Hide Forgot
Created attachment 1871007 [details] CVO log file Created attachment 1871007 [details] CVO log file Description of problem: During minor upgrade from 4.10 to 4.11, CVO sets ReleaseAccepted=False once it finds etcd RecentBackup not true so that upgrade is never started. Previously, CVO checked etcd RecentBackup if it’s not true, CVO set Failing=true, then etcd started to take backup. After backup has been taken, CVO set Failing to false and proceeded the upgrade. # oc get clusterversion -oyaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2022-04-06T06:34:30Z" generation: 3 name: version resourceVersion: "36529" uid: d3d4b24e-9b77-4e49-8796-350c2f8cd96f spec: channel: stable-4.10 clusterID: c6e49e12-4a08-4795-beb9-fd819a14ea33 desiredUpdate: force: false image: registry.ci.openshift.org/ocp/release@sha256:28d4c78bd2ce3fa33479c0ee57372777908fead95e150f664fec8e4310cd85e4 version: "" status: availableUpdates: null conditions: - lastTransitionTime: "2022-04-06T06:34:31Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.10.7 not found in the "stable-4.10" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2022-04-06T07:14:51Z" message: 'Preconditions failed for payload loaded version="4.11.0-0.nightly-2022-03-29-152521" image="registry.ci.openshift.org/ocp/release@sha256:28d4c78bd2ce3fa33479c0ee57372777908fead95e150f664fec8e4310cd85e4": Precondition "EtcdRecentBackup" failed because of "ControllerStarted": ' reason: PreconditionChecks status: "False" type: ReleaseAccepted - lastTransitionTime: "2022-04-06T06:55:47Z" message: Done applying 4.10.7 status: "True" type: Available - lastTransitionTime: "2022-04-06T06:55:47Z" status: "False" type: Failing - lastTransitionTime: "2022-04-06T07:15:47Z" message: Cluster version is 4.10.7 status: "False" type: Progressing desired: image: quay.io/openshift-release-dev/ocp-release@sha256:347fcefa4cff84074fa56ff73a483b9fee7ba98b9a71752763502f11182a11af url: https://access.redhat.com/errata/RHBA-2022:1162 version: 4.10.7 history: - completionTime: "2022-04-06T06:55:47Z" image: quay.io/openshift-release-dev/ocp-release@sha256:347fcefa4cff84074fa56ff73a483b9fee7ba98b9a71752763502f11182a11af startedTime: "2022-04-06T06:34:31Z" state: Completed verified: false version: 4.10.7 observedGeneration: 3 versionHash: o09_Mvm2ad0= kind: List metadata: resourceVersion: "" selfLink: "" # oc get co/etcd -oyaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2022-04-06T06:34:31Z" generation: 1 name: etcd ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: d3d4b24e-9b77-4e49-8796-350c2f8cd96f resourceVersion: "29598" uid: b77ad218-07a7-4db5-b862-7d3df7954c36 spec: {} status: conditions: - lastTransitionTime: "2022-04-06T06:39:47Z" message: |- NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2022-04-06T06:56:30Z" message: |- NodeInstallerProgressing: 3 nodes are at revision 7 EtcdMembersProgressing: No unstarted etcd members found reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2022-04-06T06:41:55Z" message: |- StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 7 EtcdMembersAvailable: 3 members are available reason: AsExpected status: "True" type: Available - lastTransitionTime: "2022-04-06T06:39:47Z" message: All is well reason: AsExpected status: "True" type: Upgradeable - lastTransitionTime: "2022-04-06T06:39:47Z" reason: ControllerStarted status: Unknown type: RecentBackup extension: null relatedObjects: - group: operator.openshift.io name: cluster resource: etcds - group: "" name: openshift-config resource: namespaces - group: "" name: openshift-config-managed resource: namespaces - group: "" name: openshift-etcd-operator resource: namespaces - group: "" name: openshift-etcd resource: namespaces versions: - name: raw-internal version: 4.10.7 - name: etcd version: 4.10.7 - name: operator version: 4.10.7 Version-Release number of the following components: 4.10.7 How reproducible: Always Steps to Reproduce: 1. Install a 4.10 cluster 2. Upgrade to 4.11 # oc adm upgrade --allow-explicit-upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:28d4c78bd2ce3fa33479c0ee57372777908fead95e150f664fec8e4310cd85e4 Actual results: Upgrade exits because precondition check fails on etcd backup Expected results: CVO sets failing to true and waits for etcd backup Additional info: Please attach logs from ansible-playbook with the -vvv flag
(In reply to Yang Yang from comment #0) > During minor upgrade from 4.10 to 4.11, CVO sets ReleaseAccepted=False once > it finds etcd RecentBackup not true so that upgrade is never started. > Previously, CVO checked etcd RecentBackup if it’s not true, CVO set > Failing=true, then etcd started to take backup. After backup has been taken, > CVO set Failing to false and proceeded the upgrade. > ... > Steps to Reproduce: > 1. Install a 4.10 cluster > 2. Upgrade to 4.11 This makes it a problem in the 4.10 CVO, probably introduced into 4.10.z by [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2064991
And we'll want etcd snapshots working again by the time we are recommending 4.10 -> 4.11 updates, so setting blocker+ on this 4.11.0-targeted bug.
Seems like we need to wait for the available signed 4.12 builds to do the verification.
This is blocker+, so we're committed to fixing before 4.11 GAs, which means we're unlikely to make update-graph changes based on this. Since update-graph changes are what UpgradeBlocker is for [1], I'm dropping the keyword. [1]: https://github.com/openshift/enhancements/blob/bdf15e7a57a1f5a766e67c27c4ed9e0d03ef4bb4/enhancements/update/update-blocker-lifecycle/README.md
Verifying with 4.11.0-0.nightly-2022-05-06-215225 Steps are as below: 1. Install a cluster with 4.11.0-0.nightly-2022-05-06-215225 2. Overrides openshift-network-operator/network-operator # oc patch clusterversion version --type=merge -p '{"spec": {"overrides":[{"kind": "Deployment", "name": "network-operator", "namespace": "openshift-network-operator", "unmanaged": true, "group": "apps/v1"}]}}' 3. Upgrade to 4.11.0-0.nightly-2022-05-07-161754 # oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:a655fcffc1bf299563471eb71625eedf142b4a953f15dc2c8fa76438092495ac --allow-explicit-upgrade warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway Updating to release image registry.ci.openshift.org/ocp/release@sha256:a655fcffc1bf299563471eb71625eedf142b4a953f15dc2c8fa76438092495ac 4. Check cv conditions # oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2022-05-09T06:14:01Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-05-06-215225 not found in the "stable-4.11" channel 2022-05-09T06:14:01Z Upgradeable=False MultipleReasons: Cluster should not be upgraded between minor versions for multiple reasons: ClusterVersionOverridesSet,AdminAckRequired * Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. * Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions. 2022-05-09T06:14:01Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec 2022-05-09T06:51:11Z ReleaseAccepted=False PreconditionChecks: Preconditions failed for payload loaded version="4.11.0-0.nightly-2022-05-07-161754" image="registry.ci.openshift.org/ocp/release@sha256:a655fcffc1bf299563471eb71625eedf142b4a953f15dc2c8fa76438092495ac": Precondition "ClusterVersionUpgradeable" failed because of "ClusterVersionOverridesSet": Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. 2022-05-09T06:34:01Z Available=True : Done applying 4.11.0-0.nightly-2022-05-06-215225 2022-05-09T06:49:31Z Failing=False : 2022-05-09T06:49:46Z Progressing=False : Cluster version is 4.11.0-0.nightly-2022-05-06-215225 2022-05-09T06:50:47Z UpgradeableAdminAckRequired=False AdminAckRequired: Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions. 2022-05-09T06:50:47Z UpgradeableClusterVersionOverrides=False ClusterVersionOverridesSet: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. Nice, we get ReleaseAccepted=False due to overrides 5. Remove overrides # oc patch clusterversion version --type json -p '[{"op": "remove", "path": "/spec/overrides"}]' clusterversion.config.openshift.io/version patched 6. Check cv conditions # oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2022-05-09T06:14:01Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-05-07-161754 not found in the "stable-4.11" channel 2022-05-09T06:14:01Z Upgradeable=False AdminAckRequired: Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions. 2022-05-09T06:14:01Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec 2022-05-09T06:56:36Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.11.0-0.nightly-2022-05-07-161754" image="registry.ci.openshift.org/ocp/release@sha256:a655fcffc1bf299563471eb71625eedf142b4a953f15dc2c8fa76438092495ac" 2022-05-09T06:34:01Z Available=True : Done applying 4.11.0-0.nightly-2022-05-06-215225 2022-05-09T06:49:31Z Failing=False : 2022-05-09T06:56:36Z Progressing=True : Working towards 4.11.0-0.nightly-2022-05-07-161754: 105 of 795 done (13% complete) Payload loaded and upgrade proceeds. Looks good to me.
Thanks to Trevor and Justin, finally we get a 4.12. Verifying with 4.11.0-0.nightly-2022-05-07-161754 Steps to verify: 1. Install a 4.11 cluster 2. Upgrade to 4.12 # oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release-nightly@sha256:fb152ef66937c9cbb05467ff5b23f3b327485a90cae6686a5742375c980fea26 --allow-explicit-upgrade 3. Check cv conditions # oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2022-05-10T01:12:02Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-05-07-161754 not found in the "stable-4.11" channel 2022-05-10T01:12:02Z Upgradeable=False AdminAckRequired: Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions. 2022-05-10T01:12:02Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec 2022-05-10T01:45:35Z ReleaseAccepted=False PreconditionChecks: Preconditions failed for payload loaded version="4.12.0-0.nightly-0" image="quay.io/openshift-release-dev/ocp-release-nightly@sha256:fb152ef66937c9cbb05467ff5b23f3b327485a90cae6686a5742375c980fea26": Multiple precondition checks failed: * Precondition "ClusterVersionUpgradeable" failed because of "AdminAckRequired": Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions. * Precondition "EtcdRecentBackup" failed because of "UpgradeBackupInProgress": RecentBackup: Backup pod phase: "Pending" 2022-05-10T01:30:19Z Available=True : Done applying 4.11.0-0.nightly-2022-05-07-161754 2022-05-10T01:30:19Z Failing=False : 2022-05-10T01:30:19Z Progressing=False : Cluster version is 4.11.0-0.nightly-2022-05-07-161754 ReleaseAccepted=False due to EtcdRecentBackup and AdminAckRequired # oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2022-05-10T01:12:02Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-05-07-161754 not found in the "stable-4.11" channel 2022-05-10T01:12:02Z Upgradeable=False AdminAckRequired: Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions. 2022-05-10T01:12:02Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec 2022-05-10T01:45:35Z ReleaseAccepted=False PreconditionChecks: Preconditions failed for payload loaded version="4.12.0-0.nightly-0" image="quay.io/openshift-release-dev/ocp-release-nightly@sha256:fb152ef66937c9cbb05467ff5b23f3b327485a90cae6686a5742375c980fea26": Precondition "ClusterVersionUpgradeable" failed because of "AdminAckRequired": Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions. 2022-05-10T01:30:19Z Available=True : Done applying 4.11.0-0.nightly-2022-05-07-161754 2022-05-10T01:30:19Z Failing=False : 2022-05-10T01:30:19Z Progressing=False : Cluster version is 4.11.0-0.nightly-2022-05-07-161754 EtcdRecentBackup precondition validation passed. Then we manually provide the administrator acknowledgement # oc -n openshift-config patch cm admin-acks --patch '{"data":{"ack-4.11-kube-1.25-api-removals-in-4.12":"true"}}' --type=merge configmap/admin-acks patched # oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2022-05-10T01:12:02Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-0 not found in the "stable-4.11" channel 2022-05-10T01:12:02Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec 2022-05-10T01:47:37Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.12.0-0.nightly-0" image="quay.io/openshift-release-dev/ocp-release-nightly@sha256:fb152ef66937c9cbb05467ff5b23f3b327485a90cae6686a5742375c980fea26" 2022-05-10T01:30:19Z Available=True : Done applying 4.11.0-0.nightly-2022-05-07-161754 2022-05-10T01:30:19Z Failing=False : 2022-05-10T01:47:37Z Progressing=True : Working towards 4.12.0-0.nightly-0: 9 of 795 done (1% complete) AdminAck precondition validation passed. And upgrade proceeds. Looks good to me. Moving it to verified state.
Impact assessment Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters? * Upgrades will be impacted if the current cluster version is >= 4.10.8 and <= to 4.10.14. * Upgrades to 4.11 are specially impacted because as part of the upgrade we need to take backup of etcd and CVO does not wait for it because of this bug . What is the impact? Is it serious enough to warrant removing update recommendations? * In the event of an initial pre conditional check failure, CVO does not re-check the state of pre-conditional check. * The upgrade will not proceed because CVO will not accept the new target release because of this bug and it will continue to reconcile the current version. How involved is remediation? * The bug only impacts upgrades. So as long as the cluster stays in the current version there is nothing to be remediated. * To avoid the etcd backup issue, we suggest updating to 4.10.15 or later versions (which contains the fix) before updating to 4.11.z. Because for z stream updates CVO does not check the etcd backup precondition check. * If you stay in ReleaseAccepted = False then address the condition , clear the upgrade with “$ oc adm upgrade –clear” and try to upgrade again. Is this a regression? * Yes. The fix for https://bugzilla.redhat.com/show_bug.cgi?id=2064991 introduced it.
[1] landed a conditional risk for 4.10.14 to 4.11.0-fc.0, so I'm setting UpdateRecommendationsBlocked. [1]: https://github.com/openshift/cincinnati-graph-data/pull/2069
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069