Description of problem: In OCP versions 4.8 and older, using a GitOps workflow during a y-stream upgrade does not function correctly if you update your channel and desired version at the same time. For example, if I am on 4.7.22 and desire to get to 4.8.x, I need to: Change channel from fast-4.7 to fast-4.8 Change desired version from 4.7.22 to 4.8.3 (after I have verified this is a valid path) The CVO apparently doesn't recognize the channel change for a period of time, but will attempt the version check sooner, which leads to an error because the desired version isn't in the available version list: The cluster version is invalid: spec.desiredUpdate.version: Invalid value: "4.8.3": when image is empty the update must be a previous version or an available update How reproducible: Always Steps to Reproduce: 1. For example, if the cluster's current version is 4.7.21 and the channel is fast-4.7 , attempt an update to version 4.8.2 changing the desired version and channel at the same time to fast-4.8 $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "fast-4.8"}, {"op": "add", "path": "/spec/desiredUpdate", "value": {"version": "4.8.2"}}]' 2. Cluster version operator will complain "Stopped at 4.7.21: the cluster version is invalid" $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.21 True False 17m Stopped at 4.7.21: the cluster version is invalid $ oc get clusterversion -o yaml - lastTransitionTime: "2021-08-06T16:13:35Z" status: "False" type: Failing - lastTransitionTime: "2021-08-06T17:55:27Z" message: 'Stopped at 4.7.21: the cluster version is invalid' reason: InvalidClusterVersion status: "False" type: Progressing - lastTransitionTime: "2021-08-06T18:12:01Z" message: 'The cluster version is invalid: spec.desiredUpdate.version: Invalid value: "4.8.2": when image is empty the update must be a previous version or an available update' reason: InvalidClusterVersion status: "True" type: Invalid Expected results: The update should start to 4.8.2
The relevant CVO code is very old, and goes back at least as far as 4.6 (our oldest supported version [1]). Setting Version back to 4.6 so folks mulling over backports don't have to wonder about that. [1]: https://access.redhat.com/support/policy/updates/openshift#dates
Reducing the severity to medium as this will not be a blocker for update as breaking down the single step to two steps will help as a workaround.
I had thought that update preconditions might have fallen into the same loop, but it turns out they are in a different loop, and we already poll the preconditions. Confirming in 4.8.5, by setting an override [1]: $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version 4.8.5 $ cat <<EOF >version-patch-first-override.yaml > - op: add > path: /spec/overrides > value: > - kind: Deployment > group: apps/v1 > name: network-operator > namespace: openshift-network-operator > unmanaged: true > EOF $ oc patch clusterversion version --type json -p "$(cat version-patch-first-override.yaml)" $ oc get -o json clusterversion version | jq -r '.status.conditions[] | select(.type == "Upgradeable") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2021-08-23T17:33:38Z Upgradeable=False ClusterVersionOverridesSet: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. $ oc adm upgrade channel candidate-4.8 # requires a 4.9+ oc binary $ oc adm upgrade --to 4.8.6 wait a bit for the download and preconditions. Then: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2021-08-23T17:28:53Z Available=True : Done applying 4.8.5 2021-08-23T17:40:16Z Failing=True UpgradePreconditionCheckFailed: Precondition "ClusterVersionUpgradeable" failed because of "ClusterVersionOverridesSet": Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. 2021-08-23T17:39:51Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.8.6: it may not be safe to apply this update 2021-08-23T17:39:29Z RetrievedUpdates=True : 2021-08-23T17:33:38Z Upgradeable=False ClusterVersionOverridesSet: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. And we are polling those conditions, with multiple PreconditionsFailed counts: $ oc -n openshift-cluster-version get -o json events | jq -r '.items[] | select(.reason == "PreconditionsFailed") | .firstTimestamp + " " + (.count | tostring) + " " + .lastTimestamp + " " + .reason + ": " + .message' 2021-08-23T17:40:10Z 3 2021-08-23T17:45:05Z PreconditionsFailed: preconditions failed for payload loaded version="4.8.6" image="quay.io/openshift-release-dev/ocp-release@sha256:e64c04c41ae7717fff4b341987ac37c313045d4c3aa7bb8c6bfe8bf8540a5025" failures=Precondition "ClusterVersionUpgradeable" failed because of "ClusterVersionOverridesSet": Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. So this bug is just about polling "does the new desiredUpdate.version appear in availableUpdates, so we can get the associated pullspec?", not about polling in later target-acceptance steps. [1]: https://github.com/openshift/enhancements/blob/f97876821d3bb506d28fee565271d3bebbbc682c/dev-guide/cluster-version-operator/dev/clusterversion.md#setting-objects-unmanaged
> So this bug is just about polling "does the new desiredUpdate.version appear in availableUpdates, so we can get the associated pullspec?", not about polling in later target-acceptance steps. That's my understanding too.
Poking around with a cluster-bot 4.8.11 (so channel is not set out of the box): $ oc get -o json clusterversion version | jq '{spec: (.spec | {channel, desiredUpdate}) , status: (.status | {availableUpdates, conditions: ([.conditions[] | select(.type == "Failing" or .type == "RetrievedUpdates")])})}' { "spec": { "channel": null, "desiredUpdate": null }, "status": { "availableUpdates": null, "conditions": [ { "lastTransitionTime": "2021-09-21T22:45:35Z", "status": "False", "type": "Failing" }, { "lastTransitionTime": "2021-09-21T22:21:07Z", "message": "The update channel has not been configured.", "reason": "NoChannel", "status": "False", "type": "RetrievedUpdates" } ] } } Now set the channel and target release at the same time: $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "fast-4.8"}, {"op": "add", "path": "/spec/desiredUpdate", "value": {"version": "4.8.12"}}]' After a bit: $ oc get -o json clusterversion version | jq '{spec: (.spec | {channel, desiredUpdate}) , status: (.status | {desired, availableUpdates, conditions: ([.conditions[] | select(.type == "Failing" or .type == "RetrievedUpdates" or .type == "Invalid")])})}' { "spec": { "channel": "fast-4.8", "desiredUpdate": { "version": "4.8.12" } }, "status": { "desired": { "image": "registry.ci.openshift.org/ocp/release@sha256:26f9da8c2567ddf15f917515008563db8b3c9e43120d3d22f9d00a16b0eb9b97", "url": "https://access.redhat.com/errata/RHBA-2021:3429", "version": "4.8.11" }, "availableUpdates": null, "conditions": [ { "lastTransitionTime": "2021-09-21T22:45:35Z", "status": "False", "type": "Failing" }, { "lastTransitionTime": "2021-09-21T22:21:07Z", "message": "The update channel has not been configured.", "reason": "NoChannel", "status": "False", "type": "RetrievedUpdates" }, { "lastTransitionTime": "2021-09-21T23:00:54Z", "message": "The cluster version is invalid: spec.desiredUpdate.version: Invalid value: \"4.8.12\": when image is empty the update must be a previous version or an available update", "reason": "InvalidClusterVersion", "status": "True", "type": "Invalid" } ] } } I had expected availableUpdates to get populated. What's going on with that? $ oc -n openshift-cluster-version get pods NAME READY STATUS RESTARTS AGE cluster-version-operator-5c8745d67c-xblfx 1/1 Running 1 42m Checking logs: $ oc -n openshift-cluster-version logs cluster-version-operator-5c8745d67c-xblfx | grep -1 available | tail -n4 I0921 23:04:11.471613 1 cvo.go:483] Finished syncing cluster version "openshift-cluster-version/version" (502.446µs) I0921 23:04:11.471686 1 cvo.go:552] Started syncing available updates "openshift-cluster-version/version" (2021-09-21 23:04:11.471679877 +0000 UTC m=+1417.074284846) I0921 23:04:11.471832 1 cvo.go:554] Finished syncing available updates "openshift-cluster-version/version" (146.66µs) I0921 23:04:11.471901 1 cvo.go:574] Started syncing upgradeable "openshift-cluster-version/version" (2021-09-21 23:04:11.471894407 +0000 UTC m=+1417.074499388) Aha, because Operator.availableUpdatesSync is calling ValidateClusterVersion [1], and failing when the given version is not listed in availableVersions yet [2]. We want to special case that issue, and continue on to call syncAvailableUpdates [3] for an of the 'len(u.Version) > 0 && len(u.Image) == 0' cases. However availableUpdatesSync is not the only ValidateClusterVersion consumer: $ git --no-pager grep 'func \|ValidateClusterVersion' | grep -B1 '[.]ValidateClusterVersion' pkg/cvo/cvo.go:func (optr *Operator) sync(ctx context.Context, key string) error { pkg/cvo/cvo.go: errs := validation.ValidateClusterVersion(original) pkg/cvo/cvo.go:func (optr *Operator) availableUpdatesSync(ctx context.Context, key string) error { pkg/cvo/cvo.go: if errs := validation.ValidateClusterVersion(config); len(errs) > 0 { pkg/cvo/cvo.go:func (optr *Operator) upgradeableSync(ctx context.Context, key string) error { pkg/cvo/cvo.go: if errs := validation.ValidateClusterVersion(config); len(errs) > 0 { I dunno why upgradeableSync feels the need for this guard. Blame says we've had it there since the function was created [4], but none of the checks seem particularly relevant to the Upgradeable collection (where they care about things like the cluster version, I'd expect them to care about status properties, not spec properties). So I think a reasonable plan for this bug would be: * Drop the ValidateClusterVersion guard from upgradeableSync. * Shift the 'len(u.Version) > 0 && len(u.Image) == 0' guard from ValidateClusterVersion to Operator.sync. Then availableUpdatesSync will no longer trip over that guard, and we'll continue to poll the upstream (when a channel is set) to get fresh update recommendations. And Operator.sync will block on the inability to find the version in available updates (which it needs to do because the caller didn't specify and image pullspec) until we eventually get an availableUpdates entry that matches the version, after which the update will begin. Also, "Invalid" may deserve a more specific condition type name (OwnSpecInvalid?), and probably needs covering alerts and all that like we give to Failing. Although it looks like reconciliation is continuing without issue while the CVO complains about the invalid spec: $ date --utc --iso=m 2021-09-21T23:33+00:00 $ oc -n openshift-cluster-version logs cluster-version-operator-5c8745d67c-xblfx | grep 'Running sync.*in state\|Result of work' | tail -n2 I0921 23:30:57.096586 1 sync_worker.go:541] Running sync 4.8.11 (force=false) on generation 2 in state Reconciling at attempt 0 I0921 23:31:22.463061 1 task_graph.go:555] Result of work: [] [1]: https://github.com/openshift/cluster-version-operator/blob/e816c118ac608f131d24b28d617e91d9d5cc34a6/pkg/cvo/cvo.go#L592 [2]: https://github.com/openshift/cluster-version-operator/blob/e816c118ac608f131d24b28d617e91d9d5cc34a6/lib/validation/validation.go#L42 [3]: https://github.com/openshift/cluster-version-operator/blob/e816c118ac608f131d24b28d617e91d9d5cc34a6/pkg/cvo/cvo.go#L595 [4]: https://github.com/openshift/cluster-version-operator/commit/04528144feb7a8141801bce591fda43d65acc48a#diff-490d2318856a4a078992ebab5b3f70db6b2c074dee480aa0112dc7c52e37550eR463
Reproduced it: 1. Install a 4.8 cluster # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.19 True False 20h Cluster version is 4.8.19 2. Patch to update the channel and desired version # oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.9"}, {"op": "add", "path": "/spec/desiredUpdate", "value": {"version": "4.9.6"}}]' # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.19 True False 20h Stopped at 4.8.19: the cluster version is invalid # oc get clusterversion -oyaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2021-11-09T08:56:50Z" generation: 2 name: version resourceVersion: "473050" uid: 51dd6fbb-966e-41f1-bd15-05cbde4cd5ad spec: channel: candidate-4.9 clusterID: 9331eba0-85a8-4a94-af81-739f89c70c97 desiredUpdate: version: 4.9.6 status: availableUpdates: null conditions: - lastTransitionTime: "2021-11-09T09:22:55Z" message: Done applying 4.8.19 status: "True" type: Available - lastTransitionTime: "2021-11-10T04:49:47Z" status: "False" type: Failing - lastTransitionTime: "2021-11-09T09:22:55Z" message: 'Stopped at 4.8.19: the cluster version is invalid' reason: InvalidClusterVersion status: "False" type: Progressing - lastTransitionTime: "2021-11-09T08:56:50Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.8.19 not found in the "stable-4.8" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2021-11-09T08:57:20Z" message: | Kubernetes 1.22 and therefore OpenShift 4.9 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6329921 for details and instructions. reason: AdminAckRequired status: "False" type: Upgradeable - lastTransitionTime: "2021-11-10T05:53:47Z" message: 'The cluster version is invalid: spec.desiredUpdate.version: Invalid value: "4.9.6": when image is empty the update must be a previous version or an available update' reason: InvalidClusterVersion status: "True" type: Invalid desired: image: quay.io/openshift-release-dev/ocp-release@sha256:ac19c975be8b8a449dedcdd7520e970b1cc827e24042b8976bc0495da32c6b59 url: https://access.redhat.com/errata/RHBA-2021:4109 version: 4.8.19 history: - completionTime: "2021-11-09T09:22:55Z" image: quay.io/openshift-release-dev/ocp-release@sha256:ac19c975be8b8a449dedcdd7520e970b1cc827e24042b8976bc0495da32c6b59 startedTime: "2021-11-09T08:56:50Z" state: Completed verified: false version: 4.8.19 observedGeneration: 1 versionHash: oJVcBisP_Ao= kind: List metadata: resourceVersion: "" selfLink: ""
Verifying with: # ./oc version Client Version: 4.10.0-0.nightly-2021-11-09-181140 Server Version: 4.10.0-0.nightly-2021-11-09-181140 Kubernetes Version: v1.22.1+1b2affc Patch to change the Cincinnati. # ./oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph"}}' --type=merge clusterversion.config.openshift.io/version patched # ./oc adm upgrade Cluster version is 4.10.0-0.nightly-2021-11-09-181140 Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph Channel: stable-4.9 No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and may result in downtime or data loss. # ./oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "nightly-4.10"}, {"op": "add", "path": "/spec/desiredUpdate", "value": {"version": "4.10.0-0.nightly-2021-11-11-072405"}}]' clusterversion.config.openshift.io/version patched # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-09-181140 True True 40s Unable to apply 4.10.0-0.nightly-2021-11-11-072405: the image may not be safe to use cv conditions: conditions: - lastTransitionTime: "2021-11-11T06:51:16Z" message: Done applying 4.10.0-0.nightly-2021-11-09-181140 status: "True" type: Available - lastTransitionTime: "2021-11-11T13:33:46Z" message: 'The update cannot be verified: unable to locate a valid signature for one or more sources' reason: ImageVerificationFailed status: "True" type: Failing - lastTransitionTime: "2021-11-11T13:33:44Z" message: 'Unable to apply 4.10.0-0.nightly-2021-11-11-072405: the image may not be safe to use' reason: ImageVerificationFailed status: "True" type: Progressing - lastTransitionTime: "2021-11-11T13:31:39Z" status: "True" type: RetrievedUpdates The upgrade is not blocked by invalid version any more. Moving it to verified state.
Hi Lala, with this fix, CVO supports the upgrade by patching the channel and desired version at the same time. But from oc client perspective, if we run oc adm upgrade channel and oc adm upgrade --to in parallel, the oc adm upgrade --to would prompt an error because the available updates list has not been resolved yet. Would you address it to make oc have the ability to change the channel and upgrade the cluster at the same time? # ./oc adm upgrade channel nightly-4.10; ./oc adm upgrade --to 4.10.0-0.nightly-2021-11-20-181820 warning: No channels known to be compatible with the current version "4.10.0-0.nightly-2021-11-20-143156"; unable to validate "nightly-4.10". Setting the update channel to "nightly-4.10" anyway. error: No available updates, specify --to-image or wait for new updates to be available Thanks.
If we wanted to adjust oc, I think that would be a separate ticket. The cluster-version operator is a long-running process, so it's a fairly low-level change to have it retry where it used to stick before. But oc calls are one-shot on the client side, and we probably don't want to teach it to retry in the expectation that maybe soon the --to target being passed in will show up as an available update. And there's currently no path through the "that's not an available update" guards around --to [1]. You could use --to-image today, possibly in conjunction with --allow-explicit-upgrade, which is the recommended approach for clusters where there is no upstream update service (or when folks are testing updates that are not recommended). It's possible that oc could grow something like "when --to is not an available update and --allow-explicit-upgrade is set, then just set the desired target version, and don't worry about the lack of pullspec", but again, fiddly enough that I think that would deserve it's own, separate ticket. [1]: https://github.com/openshift/oc/blob/b996c1021930d711ebf608f8c4c8ac77fecb1cbe/pkg/cli/admin/upgrade/upgrade.go#L241-L249
Thanks Trevor. I totally agree that it's a separate ticket. Can we create a jira ticket in OTA project for further discussion?
If you want to you can create a bugzilla or Jira , either works for me. IMO this is a low severity bug because I do not expect users to run oc adm upgrade channel and oc adm upgrade --to in parallel.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056