Bug 2006611
| Summary: | CVO resolves the version takes a long time sometimes when upgrading via `--to-image` | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Yang Yang <yanyang> |
| Component: | Cluster Version Operator | Assignee: | David Hurta <dhurta> |
| Status: | CLOSED ERRATA | QA Contact: | Yang Yang <yanyang> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.8 | CC: | aos-bugs, bleanhar, dhurta, geliu, jialiu, jiazha, lmohanty |
| Target Milestone: | --- | ||
| Target Release: | 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-17 19:46:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Yang Yang
2021-09-22 03:45:21 UTC
Checking with upgrade from 4.9 to 4.10, after a while, CVO digests the failure on OLM upgradeable and proceeds the upgrade. So it doesn't block upgrade.
# oc get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
creationTimestamp: "2021-09-22T01:58:55Z"
generation: 2
name: version
resourceVersion: "46459"
uid: 7aac81aa-01e2-49b3-b9f0-6e96f065ed9b
spec:
channel: stable-4.9
clusterID: 2897ebe3-212d-4e26-ba4b-4ce967167d64
desiredUpdate:
force: false
image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
version: ""
status:
availableUpdates: null
conditions:
- lastTransitionTime: "2021-09-22T02:22:12Z"
message: Done applying 4.9.0-rc.1
status: "True"
type: Available
- lastTransitionTime: "2021-09-22T03:36:42Z"
message: 'Precondition "ClusterVersionUpgradeable" failed because of "ErrorCheckingOperatorCompatibility":
Cluster operator operator-lifecycle-manager should not be upgraded between
minor versions: Encountered errors while checking compatibility with the next
minor version of OpenShift: Desired release version missing from ClusterVersion'
reason: UpgradePreconditionCheckFailed
status: "True"
type: Failing
- lastTransitionTime: "2021-09-22T03:36:21Z"
message: 'Unable to apply 4.10.0-0.nightly-2021-09-21-181111: it may not be
safe to apply this update'
reason: UpgradePreconditionCheckFailed
status: "True"
type: Progressing
- lastTransitionTime: "2021-09-22T01:58:55Z"
message: 'Unable to retrieve available updates: currently reconciling cluster
version 4.9.0-rc.1 not found in the "stable-4.9" channel'
reason: VersionNotFound
status: "False"
type: RetrievedUpdates
- lastTransitionTime: "2021-09-22T03:36:42Z"
message: 'Cluster operator operator-lifecycle-manager should not be upgraded
between minor versions: Encountered errors while checking compatibility with
the next minor version of OpenShift: Desired release version missing from
ClusterVersion'
reason: ErrorCheckingOperatorCompatibility
status: "False"
type: Upgradeable
desired:
image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
version: 4.10.0-0.nightly-2021-09-21-181111
history:
- completionTime: null
image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
startedTime: "2021-09-22T03:36:21Z"
state: Partial
verified: true
version: 4.10.0-0.nightly-2021-09-21-181111
- completionTime: "2021-09-22T02:22:12Z"
image: quay.io/openshift-release-dev/ocp-release@sha256:2cce76f4dc2400d3c374f76ac0aa4e481579fce293e732f0b27775b7218f2c8d
startedTime: "2021-09-22T01:58:55Z"
state: Completed
verified: false
version: 4.9.0-rc.1
observedGeneration: 2
versionHash: F-Tl07K3E1k=
kind: List
metadata:
resourceVersion: ""
selfLink: ""
############
After a while, check again...
# oc get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
creationTimestamp: "2021-09-22T01:58:55Z"
generation: 2
name: version
resourceVersion: "49930"
uid: 7aac81aa-01e2-49b3-b9f0-6e96f065ed9b
spec:
channel: stable-4.9
clusterID: 2897ebe3-212d-4e26-ba4b-4ce967167d64
desiredUpdate:
force: false
image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
version: ""
status:
availableUpdates: null
conditions:
- lastTransitionTime: "2021-09-22T02:22:12Z"
message: Done applying 4.9.0-rc.1
status: "True"
type: Available
- lastTransitionTime: "2021-09-22T03:41:12Z"
status: "False"
type: Failing
- lastTransitionTime: "2021-09-22T03:36:21Z"
message: 'Working towards 4.10.0-0.nightly-2021-09-21-181111: 95 of 739 done
(12% complete)'
status: "True"
type: Progressing
- lastTransitionTime: "2021-09-22T01:58:55Z"
message: 'Unable to retrieve available updates: currently reconciling cluster
version 4.10.0-0.nightly-2021-09-21-181111 not found in the "stable-4.9" channel'
reason: VersionNotFound
status: "False"
type: RetrievedUpdates
desired:
image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
version: 4.10.0-0.nightly-2021-09-21-181111
history:
- completionTime: null
image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
startedTime: "2021-09-22T03:36:21Z"
state: Partial
verified: true
version: 4.10.0-0.nightly-2021-09-21-181111
- completionTime: "2021-09-22T02:22:12Z"
image: quay.io/openshift-release-dev/ocp-release@sha256:2cce76f4dc2400d3c374f76ac0aa4e481579fce293e732f0b27775b7218f2c8d
startedTime: "2021-09-22T01:58:55Z"
state: Completed
verified: false
version: 4.9.0-rc.1
observedGeneration: 2
versionHash: A718pGr3uf8=
kind: List
metadata:
resourceVersion: ""
selfLink: ""
When upgrading the 4.9 to 4.10, meet this issue. raise the Priority/Severity, as follows:
[cloud-user@preserve-olm-env jian]$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 --allow-explicit-upgrade --allow-upgrade-with-warnings
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --allow-upgrade-with-warnings is bypassing: already upgrading.
Reason: ImageVerificationFailed
Message: Unable to apply registry.ci.openshift.org/ocp/release@sha256:b8375a1c73d968d340dda2a8c38f6e417f1ff2d7facac579986a193a0e922be5: the image may not be safe to use
Updating to release image registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
[cloud-user@preserve-olm-env jian]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.9.0-0.nightly-2021-09-17-210126 True True 4m34s Unable to apply 4.10.0-0.nightly-2021-09-21-181111: it may not be safe to apply this update
[cloud-user@preserve-olm-env jian]$ oc get clusterversion version -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
creationTimestamp: "2021-09-22T04:08:46Z"
generation: 4
name: version
resourceVersion: "47959"
uid: 3389b705-45d2-4a50-8eea-aa22249def23
spec:
channel: stable-4.9
clusterID: 2a995974-0127-42ad-a867-398aab50523b
desiredUpdate:
force: false
image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
version: ""
upstream: https://amd64.ocp.releases.ci.openshift.org/graph
status:
availableUpdates:
- image: registry.ci.openshift.org/ocp/release@sha256:902addea15526d53d37e0343b233ca6ed0d9474613087fd867ffa8a9df3d78bc
version: 4.9.0-0.nightly-2021-09-18-052905
conditions:
- lastTransitionTime: "2021-09-22T04:35:48Z"
message: Done applying 4.9.0-0.nightly-2021-09-17-210126
status: "True"
type: Available
- lastTransitionTime: "2021-09-22T05:44:03Z"
message: "Multiple precondition checks failed:\n* Precondition \"ClusterVersionUpgradeable\"
failed because of \"ErrorCheckingOperatorCompatibility\": Cluster operator operator-lifecycle-manager
should not be upgraded between minor versions: Encountered errors while checking
compatibility with the next minor version of OpenShift: Desired release version
missing from ClusterVersion\n* Precondition \"EtcdRecentBackup\" failed because
of \"ControllerStarted\": "
reason: UpgradePreconditionCheckFailed
status: "True"
type: Failing
- lastTransitionTime: "2021-09-22T05:40:48Z"
message: 'Unable to apply 4.10.0-0.nightly-2021-09-21-181111: it may not be safe
to apply this update'
reason: UpgradePreconditionCheckFailed
status: "True"
type: Progressing
- lastTransitionTime: "2021-09-22T05:39:57Z"
status: "True"
type: RetrievedUpdates
- lastTransitionTime: "2021-09-22T05:43:24Z"
message: 'Cluster operator operator-lifecycle-manager should not be upgraded between
minor versions: Encountered errors while checking compatibility with the next
minor version of OpenShift: Desired release version missing from ClusterVersion'
reason: ErrorCheckingOperatorCompatibility
status: "False"
type: Upgradeable
desired:
image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
version: 4.10.0-0.nightly-2021-09-21-181111
history:
- completionTime: null
image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
startedTime: "2021-09-22T05:43:48Z"
state: Partial
verified: true
version: 4.10.0-0.nightly-2021-09-21-181111
- completionTime: "2021-09-22T05:43:48Z"
image: registry.ci.openshift.org/ocp/release@sha256:b8375a1c73d968d340dda2a8c38f6e417f1ff2d7facac579986a193a0e922be5
startedTime: "2021-09-22T05:40:48Z"
state: Partial
verified: false
version: ""
- completionTime: "2021-09-22T04:35:48Z"
image: registry.ci.openshift.org/ocp/release@sha256:0cc74698c0c6ea9d8658a3c42761befcfe5f559dbbb0eb39a6705e4cc9e29c58
startedTime: "2021-09-22T04:08:46Z"
state: Completed
verified: false
version: 4.9.0-0.nightly-2021-09-17-210126
observedGeneration: 4
versionHash: NxpKvzavsQo=
Yang, Jian, In the first example -- 4.8 -> 4.9 -- it looks like there's a second issue preventing the upgrade: the given image has an invalid signature. By design, OLM signals CVO to block upgrades when it can't determine the desired cluster version. When a specific image is given, OLM depends on CVO to resolve the version contained within that image (by looking at the clusterversion's status.desired.version field). In the meantime, OLM needs to signal that it can't determine compatibility due to a transitive error -- desired version isn't set yet -- so that it can prevent upgrades that would otherwise break the cluster. In the second example -- 4.9 -> 4.10 -- it looks like OLM correctly re-evaluates upgradeability once the clusterversion's status.desired.version field is resolved. There seems to be some delay -- on the order of minutes -- but I'd need to see OLM's ClusterOperator to know whether that delay was in CVO or in OLM (I suspect CVO). In the third example, it looks like the state is about to be made consistent. Jian, did you happen to check this afterwards to see if it eventually resolved? Could you get the operator-lifecycle-manager ClusterOperator so we can see if OLM detected the desired version and re-evaluated correctly? My suspicion is that CVO is just slow in picking up changes to ClusterOperator resources. I'm not sure this constitutes a bug since: - The behavior looks correct (OLM doesn't allow upgrades when the cluster version isn't known to CVO yet) - It looks like things are eventually consistent (upgrade when they should) - Cluster admins can always force an upgrade - Updating directly to an image seems like it's already off of the recommended upgrade path (i.e. thar be dragons), but please correct me if I'm wrong here Thanks for looking into it! > In the first example -- 4.8 -> 4.9 -- it looks like there's a second issue preventing the upgrade: the given image has an invalid signature. Ack. The image I was upgrading to is not signed. It's not relevant to the bug. > When a specific image is given, OLM depends on CVO to resolve the version contained within that image Ack. CVO resolves the version and sets the clusterversion's status.desired.version field > There seems to be some delay -- on the order of minutes -- but I'd need to see OLM's ClusterOperator to know whether that delay was in CVO or in OLM (I suspect CVO). Yeah. I think it's the key issue here. The delay is a bit long and would make users think, oh, it's not supported to upgrade to this image... So It would be better to improve it. - The behavior looks correct (OLM doesn't allow upgrades when the cluster version isn't known to CVO yet) Ack. - It looks like things are eventually consistent (upgrade when they should) Ack. - Cluster admins can always force an upgrade Ack. - Updating directly to an image seems like it's already off of the recommended upgrade path Yeah. But it's still the most frequently used way to get a cluster upgraded to a nightly build from QE. > Jian, did you happen to check this afterwards to see if it eventually resolved? Could you get the operator-lifecycle-manager ClusterOperator so we can see if OLM detected the desired version and re-evaluated correctly? My suspicion is that CVO is just slow in picking up changes to ClusterOperator resources.
Yes, the CVO resolves the version and sets the clusterversion's status.desired.version field after a while, and no OLM blocks.
[cloud-user@preserve-olm-env jian]$ oc get clusterversion version -o yaml
apiVersion: config.openshift.io/v1
...
desired:
image: registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08
version: 4.10.0-0.nightly-2021-09-21-181111
Sometimes, the CVO resolves the version takes a long time, about 5+ mins, sometimes not(about 2 mins). Transfer it to the CVO team for a look, thanks!
Lower the priority since it will be resolved finally.
[cloud-user@preserve-olm-env jian]$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:af7ac552a0b5b05ef05e710791e1803577e521d96ce61abafa48f9f0642c5a08 --allow-explicit-upgrade --allow-upgrade-with-warnings
From comment#1, it takes around 5 minutes to get the failing moving from true to false. The time is a bit long. It would be better to improve it. - lastTransitionTime: "2021-09-22T03:36:42Z" message: 'Precondition "ClusterVersionUpgradeable" failed because of "ErrorCheckingOperatorCompatibility": Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: Encountered errors while checking compatibility with the next minor version of OpenShift: Desired release version missing from ClusterVersion' reason: UpgradePreconditionCheckFailed status: "True" type: Failing - lastTransitionTime: "2021-09-22T03:41:12Z" status: "False" type: Failing *** Bug 2072348 has been marked as a duplicate of this bug. *** Verifying it with 4.12.0-0.nightly-2022-08-31-101631
1. Install a cluster with 4.12.0-0.nightly-2022-08-31-101631
2. Make MCO upgradeable=false by cordoning all the nodes
# NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)"
# for NODE in ${NODES}; do oc adm cordon "${NODE}"; done
node/yanyang-0901b-f97vx-master-0.c.openshift-qe.internal cordoned
node/yanyang-0901b-f97vx-master-1.c.openshift-qe.internal cordoned
node/yanyang-0901b-f97vx-master-2.c.openshift-qe.internal cordoned
node/yanyang-0901b-f97vx-worker-a-pdw7r.c.openshift-qe.internal cordoned
node/yanyang-0901b-f97vx-worker-b-xgp5w.c.openshift-qe.internal cordoned
node/yanyang-0901b-f97vx-worker-c-llgkn.c.openshift-qe.internal cordoned
# oc adm upgrade
warning: Cannot display available updates:
Reason: VersionNotFound
Message: Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-08-31-101631 not found in the "stable-4.11" channel
Cluster version is 4.12.0-0.nightly-2022-08-31-101631
Reason: PoolUpdating
Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details
Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.11
Fine, got Upgradeable=False
3. Upgrade the cluster
# oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release@sha256:98c294a87ab2e794e313e2d7da8a63efcc01e3e850ad92281ae39b7e525696e6 --allow-explicit-upgrade
4. Uncordon all the nodes
# for NODE in ${NODES}; do oc adm uncordon "${NODE}"; done
node/yanyang-0901b-f97vx-master-0.c.openshift-qe.internal uncordoned
node/yanyang-0901b-f97vx-master-1.c.openshift-qe.internal uncordoned
node/yanyang-0901b-f97vx-master-2.c.openshift-qe.internal uncordoned
node/yanyang-0901b-f97vx-worker-a-pdw7r.c.openshift-qe.internal uncordoned
node/yanyang-0901b-f97vx-worker-b-xgp5w.c.openshift-qe.internal uncordoned
node/yanyang-0901b-f97vx-worker-c-llgkn.c.openshift-qe.internal uncordoned
5. Get clusterversion every 5 seconds
- lastTransitionTime: "2022-09-01T06:04:31Z"
message: 'Preconditions failed for payload loaded version="4.11.2" image="quay.io/openshift-release-dev/ocp-release@sha256:98c294a87ab2e794e313e2d7da8a63efcc01e3e850ad92281ae39b7e525696e6":
Precondition "ClusterVersionUpgradeable" failed because of "PoolUpdating":
Cluster operator machine-config should not be upgraded between minor versions:
One or more machine config pools are updating, please see `oc get mcp` for
further details'
reason: PreconditionChecks
status: "False"
type: ReleaseAccepted
- lastTransitionTime: "2022-09-01T04:14:04Z"
message: Done applying 4.12.0-0.nightly-2022-08-31-101631
status: "True"
type: Available
- lastTransitionTime: "2022-09-01T04:13:55Z"
status: "False"
type: Failing
- lastTransitionTime: "2022-09-01T04:14:04Z"
message: Cluster version is 4.12.0-0.nightly-2022-08-31-101631
status: "False"
type: Progressing
- lastTransitionTime: "2022-09-01T06:02:04Z"
message: 'Cluster operator machine-config should not be upgraded between minor
versions: One or more machine config pools are updating, please see `oc get
mcp` for further details'
reason: PoolUpdating
status: "False"
type: Upgradeable
...
- lastTransitionTime: "2022-09-01T06:04:31Z"
message: 'Preconditions failed for payload loaded version="4.11.2" image="quay.io/openshift-release-dev/ocp-release@sha256:98c294a87ab2e794e313e2d7da8a63efcc01e3e850ad92281ae39b7e525696e6":
Precondition "ClusterVersionUpgradeable" failed because of "PoolUpdating":
Cluster operator machine-config should not be upgraded between minor versions:
One or more machine config pools are updating, please see `oc get mcp` for
further details'
reason: PreconditionChecks
status: "False"
type: ReleaseAccepted
- lastTransitionTime: "2022-09-01T04:14:04Z"
message: Done applying 4.12.0-0.nightly-2022-08-31-101631
status: "True"
type: Available
- lastTransitionTime: "2022-09-01T04:13:55Z"
status: "False"
type: Failing
- lastTransitionTime: "2022-09-01T04:14:04Z"
message: Cluster version is 4.12.0-0.nightly-2022-08-31-101631
status: "False"
type: Progressing
...
- lastTransitionTime: "2022-09-01T06:10:01Z"
message: Payload loaded version="4.11.2" image="quay.io/openshift-release-dev/ocp-release@sha256:98c294a87ab2e794e313e2d7da8a63efcc01e3e850ad92281ae39b7e525696e6"
architecture="amd64"
reason: PayloadLoaded
status: "True"
type: ReleaseAccepted
- lastTransitionTime: "2022-09-01T04:14:04Z"
message: Done applying 4.12.0-0.nightly-2022-08-31-101631
status: "True"
type: Available
- lastTransitionTime: "2022-09-01T04:13:55Z"
status: "False"
type: Failing
- lastTransitionTime: "2022-09-01T06:10:01Z"
message: 'Working towards 4.11.2: 4 of 803 done (0% complete)'
status: "True"
type: Progressing
After CV's Upgradeable went to true, about 5 seconds later, ReleaseAccepted went to True as well. The PR looks good.
BTW, seems there is a delay between MCO and CVO, see,
# oc get co machine-config -oyaml
...snippet...
- lastTransitionTime: "2022-09-01T06:07:19Z"
reason: AsExpected
status: "True"
type: Upgradeable
...
At 06:07:19Z MCO's Upgradeable went to True, while CVO's Upgradeable didn't go to True until 06:10.
Anyway, moving it to verified since the CVO fix works well.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |