Description of problem: When upgrading a cluster from 4.8.12 to 4.9 nightly using <oc adm upgrade --to-image=>, CVO blocks the upgrade for 5 minutes due to ErrorCheckingOperatorCompatibility. It's a bit long. The desired.version of the clusterversion is None when using the `oc adm upgrade --to-image=xxx`. This leads to the OLM‘s upgradeable to be false which would block cluster upgrade. After a while, CVO resolves the version and OLM moves upgradeable to true and then CVO moves failing to false. But it takes 5 minutes. Version-Release number of selected component (if applicable): 4.8.12 applicable to 4.9 How reproducible: 2/2 Steps to Reproduce: 1. Install a 4.8 cluster 2. Upgrade to 4.9 # oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:b8375a1c73d968d340dda2a8c38f6e417f1ff2d7facac579986a193a0e922be5 --allow-explicit-upgrade 3. Actual results: CVO blocks upgrade for 5 minutes Expected results: Upgrade is proceeded well Additional info:
Checking with upgrade from 4.9 to 4.10, after a while, CVO digests the failure on OLM upgradeable and proceeds the upgrade. So it doesn't block upgrade. # oc get clusterversion -oyaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2021-09-22T01:58:55Z" generation: 2 name: version resourceVersion: "46459" uid: 7aac81aa-01e2-49b3-b9f0-6e96f065ed9b spec: channel: stable-4.9 clusterID: 2897ebe3-212d-4e26-ba4b-4ce967167d64 desiredUpdate: force: false image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 version: "" status: availableUpdates: null conditions: - lastTransitionTime: "2021-09-22T02:22:12Z" message: Done applying 4.9.0-rc.1 status: "True" type: Available - lastTransitionTime: "2021-09-22T03:36:42Z" message: 'Precondition "ClusterVersionUpgradeable" failed because of "ErrorCheckingOperatorCompatibility": Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: Encountered errors while checking compatibility with the next minor version of OpenShift: Desired release version missing from ClusterVersion' reason: UpgradePreconditionCheckFailed status: "True" type: Failing - lastTransitionTime: "2021-09-22T03:36:21Z" message: 'Unable to apply 4.10.0-0.nightly-2021-09-21-181111: it may not be safe to apply this update' reason: UpgradePreconditionCheckFailed status: "True" type: Progressing - lastTransitionTime: "2021-09-22T01:58:55Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.9.0-rc.1 not found in the "stable-4.9" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2021-09-22T03:36:42Z" message: 'Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: Encountered errors while checking compatibility with the next minor version of OpenShift: Desired release version missing from ClusterVersion' reason: ErrorCheckingOperatorCompatibility status: "False" type: Upgradeable desired: image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 version: 4.10.0-0.nightly-2021-09-21-181111 history: - completionTime: null image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 startedTime: "2021-09-22T03:36:21Z" state: Partial verified: true version: 4.10.0-0.nightly-2021-09-21-181111 - completionTime: "2021-09-22T02:22:12Z" image: quay.io/openshift-release-dev/ocp-release@sha256:2cce76f4dc2400d3c374f76ac0aa4e481579fce293e732f0b27775b7218f2c8d startedTime: "2021-09-22T01:58:55Z" state: Completed verified: false version: 4.9.0-rc.1 observedGeneration: 2 versionHash: F-Tl07K3E1k= kind: List metadata: resourceVersion: "" selfLink: "" ############ After a while, check again... # oc get clusterversion -oyaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2021-09-22T01:58:55Z" generation: 2 name: version resourceVersion: "49930" uid: 7aac81aa-01e2-49b3-b9f0-6e96f065ed9b spec: channel: stable-4.9 clusterID: 2897ebe3-212d-4e26-ba4b-4ce967167d64 desiredUpdate: force: false image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 version: "" status: availableUpdates: null conditions: - lastTransitionTime: "2021-09-22T02:22:12Z" message: Done applying 4.9.0-rc.1 status: "True" type: Available - lastTransitionTime: "2021-09-22T03:41:12Z" status: "False" type: Failing - lastTransitionTime: "2021-09-22T03:36:21Z" message: 'Working towards 4.10.0-0.nightly-2021-09-21-181111: 95 of 739 done (12% complete)' status: "True" type: Progressing - lastTransitionTime: "2021-09-22T01:58:55Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.10.0-0.nightly-2021-09-21-181111 not found in the "stable-4.9" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates desired: image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 version: 4.10.0-0.nightly-2021-09-21-181111 history: - completionTime: null image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 startedTime: "2021-09-22T03:36:21Z" state: Partial verified: true version: 4.10.0-0.nightly-2021-09-21-181111 - completionTime: "2021-09-22T02:22:12Z" image: quay.io/openshift-release-dev/ocp-release@sha256:2cce76f4dc2400d3c374f76ac0aa4e481579fce293e732f0b27775b7218f2c8d startedTime: "2021-09-22T01:58:55Z" state: Completed verified: false version: 4.9.0-rc.1 observedGeneration: 2 versionHash: A718pGr3uf8= kind: List metadata: resourceVersion: "" selfLink: ""
When upgrading the 4.9 to 4.10, meet this issue. raise the Priority/Severity, as follows: [cloud-user@preserve-olm-env jian]$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 --allow-explicit-upgrade --allow-upgrade-with-warnings warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to proceed anyway warning: --allow-upgrade-with-warnings is bypassing: already upgrading. Reason: ImageVerificationFailed Message: Unable to apply registry.ci.openshift.org/ocp/release@sha256:b8375a1c73d968d340dda2a8c38f6e417f1ff2d7facac579986a193a0e922be5: the image may not be safe to use Updating to release image registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 [cloud-user@preserve-olm-env jian]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-09-17-210126 True True 4m34s Unable to apply 4.10.0-0.nightly-2021-09-21-181111: it may not be safe to apply this update [cloud-user@preserve-olm-env jian]$ oc get clusterversion version -o yaml apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2021-09-22T04:08:46Z" generation: 4 name: version resourceVersion: "47959" uid: 3389b705-45d2-4a50-8eea-aa22249def23 spec: channel: stable-4.9 clusterID: 2a995974-0127-42ad-a867-398aab50523b desiredUpdate: force: false image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 version: "" upstream: https://amd64.ocp.releases.ci.openshift.org/graph status: availableUpdates: - image: registry.ci.openshift.org/ocp/release@sha256:902addea15526d53d37e0343b233ca6ed0d9474613087fd867ffa8a9df3d78bc version: 4.9.0-0.nightly-2021-09-18-052905 conditions: - lastTransitionTime: "2021-09-22T04:35:48Z" message: Done applying 4.9.0-0.nightly-2021-09-17-210126 status: "True" type: Available - lastTransitionTime: "2021-09-22T05:44:03Z" message: "Multiple precondition checks failed:\n* Precondition \"ClusterVersionUpgradeable\" failed because of \"ErrorCheckingOperatorCompatibility\": Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: Encountered errors while checking compatibility with the next minor version of OpenShift: Desired release version missing from ClusterVersion\n* Precondition \"EtcdRecentBackup\" failed because of \"ControllerStarted\": " reason: UpgradePreconditionCheckFailed status: "True" type: Failing - lastTransitionTime: "2021-09-22T05:40:48Z" message: 'Unable to apply 4.10.0-0.nightly-2021-09-21-181111: it may not be safe to apply this update' reason: UpgradePreconditionCheckFailed status: "True" type: Progressing - lastTransitionTime: "2021-09-22T05:39:57Z" status: "True" type: RetrievedUpdates - lastTransitionTime: "2021-09-22T05:43:24Z" message: 'Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: Encountered errors while checking compatibility with the next minor version of OpenShift: Desired release version missing from ClusterVersion' reason: ErrorCheckingOperatorCompatibility status: "False" type: Upgradeable desired: image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 version: 4.10.0-0.nightly-2021-09-21-181111 history: - completionTime: null image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 startedTime: "2021-09-22T05:43:48Z" state: Partial verified: true version: 4.10.0-0.nightly-2021-09-21-181111 - completionTime: "2021-09-22T05:43:48Z" image: registry.ci.openshift.org/ocp/release@sha256:b8375a1c73d968d340dda2a8c38f6e417f1ff2d7facac579986a193a0e922be5 startedTime: "2021-09-22T05:40:48Z" state: Partial verified: false version: "" - completionTime: "2021-09-22T04:35:48Z" image: registry.ci.openshift.org/ocp/release@sha256:0cc74698c0c6ea9d8658a3c42761befcfe5f559dbbb0eb39a6705e4cc9e29c58 startedTime: "2021-09-22T04:08:46Z" state: Completed verified: false version: 4.9.0-0.nightly-2021-09-17-210126 observedGeneration: 4 versionHash: NxpKvzavsQo=
Yang, Jian, In the first example -- 4.8 -> 4.9 -- it looks like there's a second issue preventing the upgrade: the given image has an invalid signature. By design, OLM signals CVO to block upgrades when it can't determine the desired cluster version. When a specific image is given, OLM depends on CVO to resolve the version contained within that image (by looking at the clusterversion's status.desired.version field). In the meantime, OLM needs to signal that it can't determine compatibility due to a transitive error -- desired version isn't set yet -- so that it can prevent upgrades that would otherwise break the cluster. In the second example -- 4.9 -> 4.10 -- it looks like OLM correctly re-evaluates upgradeability once the clusterversion's status.desired.version field is resolved. There seems to be some delay -- on the order of minutes -- but I'd need to see OLM's ClusterOperator to know whether that delay was in CVO or in OLM (I suspect CVO). In the third example, it looks like the state is about to be made consistent. Jian, did you happen to check this afterwards to see if it eventually resolved? Could you get the operator-lifecycle-manager ClusterOperator so we can see if OLM detected the desired version and re-evaluated correctly? My suspicion is that CVO is just slow in picking up changes to ClusterOperator resources. I'm not sure this constitutes a bug since: - The behavior looks correct (OLM doesn't allow upgrades when the cluster version isn't known to CVO yet) - It looks like things are eventually consistent (upgrade when they should) - Cluster admins can always force an upgrade - Updating directly to an image seems like it's already off of the recommended upgrade path (i.e. thar be dragons), but please correct me if I'm wrong here
Thanks for looking into it! > In the first example -- 4.8 -> 4.9 -- it looks like there's a second issue preventing the upgrade: the given image has an invalid signature. Ack. The image I was upgrading to is not signed. It's not relevant to the bug. > When a specific image is given, OLM depends on CVO to resolve the version contained within that image Ack. CVO resolves the version and sets the clusterversion's status.desired.version field > There seems to be some delay -- on the order of minutes -- but I'd need to see OLM's ClusterOperator to know whether that delay was in CVO or in OLM (I suspect CVO). Yeah. I think it's the key issue here. The delay is a bit long and would make users think, oh, it's not supported to upgrade to this image... So It would be better to improve it. - The behavior looks correct (OLM doesn't allow upgrades when the cluster version isn't known to CVO yet) Ack. - It looks like things are eventually consistent (upgrade when they should) Ack. - Cluster admins can always force an upgrade Ack. - Updating directly to an image seems like it's already off of the recommended upgrade path Yeah. But it's still the most frequently used way to get a cluster upgraded to a nightly build from QE.
> Jian, did you happen to check this afterwards to see if it eventually resolved? Could you get the operator-lifecycle-manager ClusterOperator so we can see if OLM detected the desired version and re-evaluated correctly? My suspicion is that CVO is just slow in picking up changes to ClusterOperator resources. Yes, the CVO resolves the version and sets the clusterversion's status.desired.version field after a while, and no OLM blocks. [cloud-user@preserve-olm-env jian]$ oc get clusterversion version -o yaml apiVersion: config.openshift.io/v1 ... desired: image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 version: 4.10.0-0.nightly-2021-09-21-181111 Sometimes, the CVO resolves the version takes a long time, about 5+ mins, sometimes not(about 2 mins). Transfer it to the CVO team for a look, thanks! Lower the priority since it will be resolved finally. [cloud-user@preserve-olm-env jian]$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 --allow-explicit-upgrade --allow-upgrade-with-warnings
From comment#1, it takes around 5 minutes to get the failing moving from true to false. The time is a bit long. It would be better to improve it. - lastTransitionTime: "2021-09-22T03:36:42Z" message: 'Precondition "ClusterVersionUpgradeable" failed because of "ErrorCheckingOperatorCompatibility": Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: Encountered errors while checking compatibility with the next minor version of OpenShift: Desired release version missing from ClusterVersion' reason: UpgradePreconditionCheckFailed status: "True" type: Failing - lastTransitionTime: "2021-09-22T03:41:12Z" status: "False" type: Failing
*** Bug 2072348 has been marked as a duplicate of this bug. ***
Verifying it with 4.12.0-0.nightly-2022-08-31-101631 1. Install a cluster with 4.12.0-0.nightly-2022-08-31-101631 2. Make MCO upgradeable=false by cordoning all the nodes # NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)" # for NODE in ${NODES}; do oc adm cordon "${NODE}"; done node/yanyang-0901b-f97vx-master-0.c.openshift-qe.internal cordoned node/yanyang-0901b-f97vx-master-1.c.openshift-qe.internal cordoned node/yanyang-0901b-f97vx-master-2.c.openshift-qe.internal cordoned node/yanyang-0901b-f97vx-worker-a-pdw7r.c.openshift-qe.internal cordoned node/yanyang-0901b-f97vx-worker-b-xgp5w.c.openshift-qe.internal cordoned node/yanyang-0901b-f97vx-worker-c-llgkn.c.openshift-qe.internal cordoned # oc adm upgrade warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-08-31-101631 not found in the "stable-4.11" channel Cluster version is 4.12.0-0.nightly-2022-08-31-101631 Reason: PoolUpdating Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.11 Fine, got Upgradeable=False 3. Upgrade the cluster # oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release@sha256:98c294a87ab2e794e313e2d7da8a63efcc01e3e850ad92281ae39b7e525696e6 --allow-explicit-upgrade 4. Uncordon all the nodes # for NODE in ${NODES}; do oc adm uncordon "${NODE}"; done node/yanyang-0901b-f97vx-master-0.c.openshift-qe.internal uncordoned node/yanyang-0901b-f97vx-master-1.c.openshift-qe.internal uncordoned node/yanyang-0901b-f97vx-master-2.c.openshift-qe.internal uncordoned node/yanyang-0901b-f97vx-worker-a-pdw7r.c.openshift-qe.internal uncordoned node/yanyang-0901b-f97vx-worker-b-xgp5w.c.openshift-qe.internal uncordoned node/yanyang-0901b-f97vx-worker-c-llgkn.c.openshift-qe.internal uncordoned 5. Get clusterversion every 5 seconds - lastTransitionTime: "2022-09-01T06:04:31Z" message: 'Preconditions failed for payload loaded version="4.11.2" image="quay.io/openshift-release-dev/ocp-release@sha256:98c294a87ab2e794e313e2d7da8a63efcc01e3e850ad92281ae39b7e525696e6": Precondition "ClusterVersionUpgradeable" failed because of "PoolUpdating": Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details' reason: PreconditionChecks status: "False" type: ReleaseAccepted - lastTransitionTime: "2022-09-01T04:14:04Z" message: Done applying 4.12.0-0.nightly-2022-08-31-101631 status: "True" type: Available - lastTransitionTime: "2022-09-01T04:13:55Z" status: "False" type: Failing - lastTransitionTime: "2022-09-01T04:14:04Z" message: Cluster version is 4.12.0-0.nightly-2022-08-31-101631 status: "False" type: Progressing - lastTransitionTime: "2022-09-01T06:02:04Z" message: 'Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details' reason: PoolUpdating status: "False" type: Upgradeable ... - lastTransitionTime: "2022-09-01T06:04:31Z" message: 'Preconditions failed for payload loaded version="4.11.2" image="quay.io/openshift-release-dev/ocp-release@sha256:98c294a87ab2e794e313e2d7da8a63efcc01e3e850ad92281ae39b7e525696e6": Precondition "ClusterVersionUpgradeable" failed because of "PoolUpdating": Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details' reason: PreconditionChecks status: "False" type: ReleaseAccepted - lastTransitionTime: "2022-09-01T04:14:04Z" message: Done applying 4.12.0-0.nightly-2022-08-31-101631 status: "True" type: Available - lastTransitionTime: "2022-09-01T04:13:55Z" status: "False" type: Failing - lastTransitionTime: "2022-09-01T04:14:04Z" message: Cluster version is 4.12.0-0.nightly-2022-08-31-101631 status: "False" type: Progressing ... - lastTransitionTime: "2022-09-01T06:10:01Z" message: Payload loaded version="4.11.2" image="quay.io/openshift-release-dev/ocp-release@sha256:98c294a87ab2e794e313e2d7da8a63efcc01e3e850ad92281ae39b7e525696e6" architecture="amd64" reason: PayloadLoaded status: "True" type: ReleaseAccepted - lastTransitionTime: "2022-09-01T04:14:04Z" message: Done applying 4.12.0-0.nightly-2022-08-31-101631 status: "True" type: Available - lastTransitionTime: "2022-09-01T04:13:55Z" status: "False" type: Failing - lastTransitionTime: "2022-09-01T06:10:01Z" message: 'Working towards 4.11.2: 4 of 803 done (0% complete)' status: "True" type: Progressing After CV's Upgradeable went to true, about 5 seconds later, ReleaseAccepted went to True as well. The PR looks good. BTW, seems there is a delay between MCO and CVO, see, # oc get co machine-config -oyaml ...snippet... - lastTransitionTime: "2022-09-01T06:07:19Z" reason: AsExpected status: "True" type: Upgradeable ... At 06:07:19Z MCO's Upgradeable went to True, while CVO's Upgradeable didn't go to True until 06:10. Anyway, moving it to verified since the CVO fix works well.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399