1990635 – CVO does not recognize the channel change if desired version and channel changed at the same time

Bug 1990635 - CVO does not recognize the channel change if desired version and channel changed at the same time

Summary: CVO does not recognize the channel change if desired version and channel chan...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Lalatendu Mohanty
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2055310 2055314
TreeView+	depends on / blocked

Reported:	2021-08-05 19:17 UTC by Lalatendu Mohanty
Modified:	2022-03-12 04:37 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2055310 (view as bug list)
Environment:
Last Closed:	2022-03-12 04:37:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 669	0	None	open	[WIP] Bug 1990635: Fixing the sync issue when desired version and channel changed at the same time	2021-10-07 05:16:32 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-12 04:37:46 UTC

Description Lalatendu Mohanty 2021-08-05 19:17:40 UTC

Description of problem:

In OCP versions 4.8 and older, using a GitOps workflow during a y-stream upgrade does not function correctly if you update your channel and desired version at the same time. For example, if I am on 4.7.22 and desire to get to 4.8.x, I need to:

Change channel from fast-4.7 to fast-4.8
Change desired version from 4.7.22 to 4.8.3 (after I have verified this is a valid path)

The CVO apparently doesn't recognize the channel change for a period of time, but will attempt the version check sooner, which leads to an error because the desired version isn't in the available version list:

The cluster version is invalid: spec.desiredUpdate.version: Invalid
value: "4.8.3": when image is empty the update must be a previous
version or an available update


How reproducible:

Always

Steps to Reproduce:
1. For example, if the cluster's current version is 4.7.21 and the channel is fast-4.7 , attempt an update to version 4.8.2 changing the desired version and channel at the same time to fast-4.8

   $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "fast-4.8"}, {"op": "add", "path": "/spec/desiredUpdate", "value": {"version": "4.8.2"}}]'

2. Cluster version operator will complain "Stopped at 4.7.21: the cluster version is invalid"

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.21    True        False         17m     Stopped at 4.7.21: the cluster version is invalid

$ oc get clusterversion -o yaml

    - lastTransitionTime: "2021-08-06T16:13:35Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2021-08-06T17:55:27Z"
      message: 'Stopped at 4.7.21: the cluster version is invalid'
      reason: InvalidClusterVersion
      status: "False"
      type: Progressing
    - lastTransitionTime: "2021-08-06T18:12:01Z"
      message: 'The cluster version is invalid: spec.desiredUpdate.version: Invalid value: "4.8.2": when image is empty the update must be a previous version or an available update'
      reason: InvalidClusterVersion
      status: "True"
      type: Invalid


Expected results:

The update should start to 4.8.2

Comment 2 W. Trevor King 2021-08-06 03:56:47 UTC

The relevant CVO code is very old, and goes back at least as far as 4.6 (our oldest supported version [1]).  Setting Version back to 4.6 so folks mulling over backports don't have to wonder about that.

[1]: https://access.redhat.com/support/policy/updates/openshift#dates

Comment 3 Lalatendu Mohanty 2021-08-06 18:48:05 UTC

Reducing the severity to medium as this will not be a blocker for update as breaking down the single step to two steps will help as a workaround.

Comment 4 W. Trevor King 2021-08-23 17:47:00 UTC

I had thought that update preconditions might have fallen into the same loop, but it turns out they are in a different loop, and we already poll the preconditions.  Confirming in 4.8.5, by setting an override [1]:

  $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version
  4.8.5
  $ cat <<EOF >version-patch-first-override.yaml
  > - op: add
  >   path: /spec/overrides
  >   value:
  >   - kind: Deployment
  >     group: apps/v1
  >     name: network-operator
  >     namespace: openshift-network-operator
  >     unmanaged: true
  > EOF
  $ oc patch clusterversion version --type json -p "$(cat version-patch-first-override.yaml)"
  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | select(.type == "Upgradeable") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-08-23T17:33:38Z Upgradeable=False ClusterVersionOverridesSet: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.
  $ oc adm upgrade channel candidate-4.8  # requires a 4.9+ oc binary
  $ oc adm upgrade --to 4.8.6

wait a bit for the download and preconditions. Then:

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + 
.reason + ": " + .message'
  2021-08-23T17:28:53Z Available=True : Done applying 4.8.5
  2021-08-23T17:40:16Z Failing=True UpgradePreconditionCheckFailed: Precondition "ClusterVersionUpgradeable" failed because of "ClusterVersionOverridesSet": Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.
  2021-08-23T17:39:51Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.8.6: it may not be safe to apply this update
  2021-08-23T17:39:29Z RetrievedUpdates=True : 
  2021-08-23T17:33:38Z Upgradeable=False ClusterVersionOverridesSet: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.

And we are polling those conditions, with multiple PreconditionsFailed counts:

  $ oc -n openshift-cluster-version get -o json events | jq -r '.items[] | select(.reason == "PreconditionsFailed") | .firstTimestamp + " " + (.count | tostring) + " " + .lastTimestamp + " " + .reason + ": " + .message'
  2021-08-23T17:40:10Z 3 2021-08-23T17:45:05Z PreconditionsFailed: preconditions failed for payload loaded version="4.8.6" image="quay.io/openshift-release-dev/ocp-release@sha256:e64c04c41ae7717fff4b341987ac37c313045d4c3aa7bb8c6bfe8bf8540a5025" failures=Precondition "ClusterVersionUpgradeable" failed because of "ClusterVersionOverridesSet": Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.

So this bug is just about polling "does the new desiredUpdate.version appear in availableUpdates, so we can get the associated pullspec?", not about polling in later target-acceptance steps.

[1]: https://github.com/openshift/enhancements/blob/f97876821d3bb506d28fee565271d3bebbbc682c/dev-guide/cluster-version-operator/dev/clusterversion.md#setting-objects-unmanaged

Comment 5 Lalatendu Mohanty 2021-08-24 21:02:41 UTC

> So this bug is just about polling "does the new desiredUpdate.version appear in availableUpdates, so we can get the associated pullspec?", not about polling in later target-acceptance steps.

That's my understanding too.

Comment 6 W. Trevor King 2021-09-21 23:33:43 UTC

Poking around with a cluster-bot 4.8.11 (so channel is not set out of the box):

  $ oc get -o json clusterversion version | jq '{spec: (.spec | {channel, desiredUpdate}) , status: (.status | {availableUpdates, conditions: ([.conditions[] | select(.type == "Failing" or .type == "RetrievedUpdates")])})}'
  {
    "spec": {
      "channel": null,
      "desiredUpdate": null
    },
    "status": {
      "availableUpdates": null,
      "conditions": [
        {
          "lastTransitionTime": "2021-09-21T22:45:35Z",
          "status": "False",
          "type": "Failing"
        },
        {
          "lastTransitionTime": "2021-09-21T22:21:07Z",
          "message": "The update channel has not been configured.",
          "reason": "NoChannel",
          "status": "False",
          "type": "RetrievedUpdates"
        }
      ]
    }
  }

Now set the channel and target release at the same time:

  $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "fast-4.8"}, {"op": "add", "path": "/spec/desiredUpdate", "value": {"version": "4.8.12"}}]'

After a bit:

  $ oc get -o json clusterversion version | jq '{spec: (.spec | {channel, desiredUpdate}) , status: (.status | {desired, availableUpdates, conditions: ([.conditions[] | select(.type == "Failing" or .type == "RetrievedUpdates" or .type == "Invalid")])})}'
  {
    "spec": {
      "channel": "fast-4.8",
      "desiredUpdate": {
        "version": "4.8.12"
      }
    },
    "status": {
      "desired": {
        "image": "registry.ci.openshift.org/ocp/release@sha256:26f9da8c2567ddf15f917515008563db8b3c9e43120d3d22f9d00a16b0eb9b97",
        "url": "https://access.redhat.com/errata/RHBA-2021:3429",
        "version": "4.8.11"
      },
      "availableUpdates": null,
      "conditions": [
        {
          "lastTransitionTime": "2021-09-21T22:45:35Z",
          "status": "False",
          "type": "Failing"
        },
        {
          "lastTransitionTime": "2021-09-21T22:21:07Z",
          "message": "The update channel has not been configured.",
          "reason": "NoChannel",
          "status": "False",
          "type": "RetrievedUpdates"
        },
        {
          "lastTransitionTime": "2021-09-21T23:00:54Z",
          "message": "The cluster version is invalid: spec.desiredUpdate.version: Invalid value: \"4.8.12\": when image is empty the update must be a previous version or an available update",
          "reason": "InvalidClusterVersion",
          "status": "True",
          "type": "Invalid"
        }
      ]
    }
  }

I had expected availableUpdates to get populated.  What's going on with that?

  $ oc -n openshift-cluster-version get pods
  NAME                                        READY   STATUS    RESTARTS   AGE
  cluster-version-operator-5c8745d67c-xblfx   1/1     Running   1          42m

Checking logs:

  $ oc -n openshift-cluster-version logs cluster-version-operator-5c8745d67c-xblfx | grep -1 available | tail -n4
  I0921 23:04:11.471613       1 cvo.go:483] Finished syncing cluster version "openshift-cluster-version/version" (502.446µs)
  I0921 23:04:11.471686       1 cvo.go:552] Started syncing available updates "openshift-cluster-version/version" (2021-09-21 23:04:11.471679877 +0000 UTC m=+1417.074284846)
  I0921 23:04:11.471832       1 cvo.go:554] Finished syncing available updates "openshift-cluster-version/version" (146.66µs)
  I0921 23:04:11.471901       1 cvo.go:574] Started syncing upgradeable "openshift-cluster-version/version" (2021-09-21 23:04:11.471894407 +0000 UTC m=+1417.074499388)

Aha, because Operator.availableUpdatesSync is calling ValidateClusterVersion [1], and failing when the given version is not listed in availableVersions yet [2].  We want to special case that issue, and continue on to call syncAvailableUpdates [3] for an of the 'len(u.Version) > 0 && len(u.Image) == 0' cases.

However availableUpdatesSync is not the only ValidateClusterVersion consumer:

  $ git --no-pager grep 'func \|ValidateClusterVersion' | grep -B1 '[.]ValidateClusterVersion'
  pkg/cvo/cvo.go:func (optr *Operator) sync(ctx context.Context, key string) error {
  pkg/cvo/cvo.go: errs := validation.ValidateClusterVersion(original)
  pkg/cvo/cvo.go:func (optr *Operator) availableUpdatesSync(ctx context.Context, key string) error {
  pkg/cvo/cvo.go: if errs := validation.ValidateClusterVersion(config); len(errs) > 0 {
  pkg/cvo/cvo.go:func (optr *Operator) upgradeableSync(ctx context.Context, key string) error {
  pkg/cvo/cvo.go: if errs := validation.ValidateClusterVersion(config); len(errs) > 0 {

I dunno why upgradeableSync feels the need for this guard.  Blame says we've had it there since the function was created [4], but none of the checks seem particularly relevant to the Upgradeable collection (where they care about things like the cluster version, I'd expect them to care about status properties, not spec properties).

So I think a reasonable plan for this bug would be:

* Drop the ValidateClusterVersion guard from upgradeableSync.
* Shift the 'len(u.Version) > 0 && len(u.Image) == 0' guard from ValidateClusterVersion to Operator.sync.

Then availableUpdatesSync will no longer trip over that guard, and we'll continue to poll the upstream (when a channel is set) to get fresh update recommendations.  And Operator.sync will block on the inability to find the version in available updates (which it needs to do because the caller didn't specify and image pullspec) until we eventually get an availableUpdates entry that matches the version, after which the update will begin.

Also, "Invalid" may deserve a more specific condition type name (OwnSpecInvalid?), and probably needs covering alerts and all that like we give to Failing.  Although it looks like reconciliation is continuing without issue while the CVO complains about the invalid spec:

  $ date --utc --iso=m
  2021-09-21T23:33+00:00
  $ oc -n openshift-cluster-version logs cluster-version-operator-5c8745d67c-xblfx | grep 'Running sync.*in state\|Result of work' | tail -n2
  I0921 23:30:57.096586       1 sync_worker.go:541] Running sync 4.8.11 (force=false) on generation 2 in state Reconciling at attempt 0
  I0921 23:31:22.463061       1 task_graph.go:555] Result of work: []

[1]: https://github.com/openshift/cluster-version-operator/blob/e816c118ac608f131d24b28d617e91d9d5cc34a6/pkg/cvo/cvo.go#L592
[2]: https://github.com/openshift/cluster-version-operator/blob/e816c118ac608f131d24b28d617e91d9d5cc34a6/lib/validation/validation.go#L42
[3]: https://github.com/openshift/cluster-version-operator/blob/e816c118ac608f131d24b28d617e91d9d5cc34a6/pkg/cvo/cvo.go#L595
[4]: https://github.com/openshift/cluster-version-operator/commit/04528144feb7a8141801bce591fda43d65acc48a#diff-490d2318856a4a078992ebab5b3f70db6b2c074dee480aa0112dc7c52e37550eR463

Comment 8 Yang Yang 2021-11-10 05:58:03 UTC

Reproduced it:

1. Install a 4.8 cluster
# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.19    True        False         20h     Cluster version is 4.8.19

2. Patch to update the channel and desired version
# oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.9"}, {"op": "add", "path": "/spec/desiredUpdate", "value": {"version": "4.9.6"}}]'

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.19    True        False         20h     Stopped at 4.8.19: the cluster version is invalid

# oc get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2021-11-09T08:56:50Z"
    generation: 2
    name: version
    resourceVersion: "473050"
    uid: 51dd6fbb-966e-41f1-bd15-05cbde4cd5ad
  spec:
    channel: candidate-4.9
    clusterID: 9331eba0-85a8-4a94-af81-739f89c70c97
    desiredUpdate:
      version: 4.9.6
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2021-11-09T09:22:55Z"
      message: Done applying 4.8.19
      status: "True"
      type: Available
    - lastTransitionTime: "2021-11-10T04:49:47Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2021-11-09T09:22:55Z"
      message: 'Stopped at 4.8.19: the cluster version is invalid'
      reason: InvalidClusterVersion
      status: "False"
      type: Progressing
    - lastTransitionTime: "2021-11-09T08:56:50Z"
      message: 'Unable to retrieve available updates: currently reconciling cluster
        version 4.8.19 not found in the "stable-4.8" channel'
      reason: VersionNotFound
      status: "False"
      type: RetrievedUpdates
    - lastTransitionTime: "2021-11-09T08:57:20Z"
      message: |
        Kubernetes 1.22 and therefore OpenShift 4.9 remove several APIs which require admin consideration. Please see
        the knowledge article https://access.redhat.com/articles/6329921 for details and instructions.
      reason: AdminAckRequired
      status: "False"
      type: Upgradeable
    - lastTransitionTime: "2021-11-10T05:53:47Z"
      message: 'The cluster version is invalid: spec.desiredUpdate.version: Invalid
        value: "4.9.6": when image is empty the update must be a previous version
        or an available update'
      reason: InvalidClusterVersion
      status: "True"
      type: Invalid
    desired:
      image: quay.io/openshift-release-dev/ocp-release@sha256:ac19c975be8b8a449dedcdd7520e970b1cc827e24042b8976bc0495da32c6b59
      url: https://access.redhat.com/errata/RHBA-2021:4109
      version: 4.8.19
    history:
    - completionTime: "2021-11-09T09:22:55Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:ac19c975be8b8a449dedcdd7520e970b1cc827e24042b8976bc0495da32c6b59
      startedTime: "2021-11-09T08:56:50Z"
      state: Completed
      verified: false
      version: 4.8.19
    observedGeneration: 1
    versionHash: oJVcBisP_Ao=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 10 Yang Yang 2021-11-11 13:42:27 UTC

Verifying with:
# ./oc version
Client Version: 4.10.0-0.nightly-2021-11-09-181140
Server Version: 4.10.0-0.nightly-2021-11-09-181140
Kubernetes Version: v1.22.1+1b2affc

Patch to change the Cincinnati.
# ./oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph"}}' --type=merge
clusterversion.config.openshift.io/version patched

# ./oc adm upgrade 
Cluster version is 4.10.0-0.nightly-2021-11-09-181140

Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph
Channel: stable-4.9
No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and may result in downtime or data loss.

# ./oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "nightly-4.10"}, {"op": "add", "path": "/spec/desiredUpdate", "value": {"version": "4.10.0-0.nightly-2021-11-11-072405"}}]'
clusterversion.config.openshift.io/version patched

# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-09-181140   True        True          40s     Unable to apply 4.10.0-0.nightly-2021-11-11-072405: the image may not be safe to use

cv conditions:
conditions:
    - lastTransitionTime: "2021-11-11T06:51:16Z"
      message: Done applying 4.10.0-0.nightly-2021-11-09-181140
      status: "True"
      type: Available
    - lastTransitionTime: "2021-11-11T13:33:46Z"
      message: 'The update cannot be verified: unable to locate a valid signature
        for one or more sources'
      reason: ImageVerificationFailed
      status: "True"
      type: Failing
    - lastTransitionTime: "2021-11-11T13:33:44Z"
      message: 'Unable to apply 4.10.0-0.nightly-2021-11-11-072405: the image may
        not be safe to use'
      reason: ImageVerificationFailed
      status: "True"
      type: Progressing
    - lastTransitionTime: "2021-11-11T13:31:39Z"
      status: "True"
      type: RetrievedUpdates


The upgrade is not blocked by invalid version any more. Moving it to verified state.

Comment 11 Yang Yang 2021-11-23 03:57:50 UTC

Hi Lala, 

with this fix, CVO supports the upgrade by patching the channel and desired version at the same time. But from oc client perspective, if we run oc adm upgrade channel and oc adm upgrade --to in parallel, the oc adm upgrade --to would prompt an error because the available updates list has not been resolved yet. Would you address it to make oc have the ability to change the channel and upgrade the cluster at the same time?  

# ./oc adm upgrade channel nightly-4.10; ./oc adm upgrade --to 4.10.0-0.nightly-2021-11-20-181820
warning: No channels known to be compatible with the current version "4.10.0-0.nightly-2021-11-20-143156"; unable to validate "nightly-4.10". Setting the update channel to "nightly-4.10" anyway.
error: No available updates, specify --to-image or wait for new updates to be available

Thanks.

Comment 12 W. Trevor King 2021-11-23 05:02:51 UTC

If we wanted to adjust oc, I think that would be a separate ticket.  The cluster-version operator is a long-running process, so it's a fairly low-level change to have it retry where it used to stick before.  But oc calls are one-shot on the client side, and we probably don't want to teach it to retry in the expectation that maybe soon the --to target being passed in will show up as an available update.  And there's currently no path through the "that's not an available update" guards around --to [1].  You could use --to-image today, possibly in conjunction with --allow-explicit-upgrade, which is the recommended approach for clusters where there is no upstream update service (or when folks are testing updates that are not recommended).  It's possible that oc could grow something like "when --to is not an available update and --allow-explicit-upgrade is set, then just set the desired target version, and don't worry about the lack of pullspec", but again, fiddly enough that I think that would deserve it's own, separate ticket.

[1]: https://github.com/openshift/oc/blob/b996c1021930d711ebf608f8c4c8ac77fecb1cbe/pkg/cli/admin/upgrade/upgrade.go#L241-L249

Comment 13 Yang Yang 2021-11-23 07:23:34 UTC

Thanks Trevor. I totally agree that it's a separate ticket. Can we create a jira ticket in OTA project for further discussion?

Comment 14 Lalatendu Mohanty 2021-12-02 14:35:19 UTC

If you want to you can create a bugzilla or Jira , either works for me. IMO this is a low severity bug because I do not expect users to run oc adm upgrade channel and oc adm upgrade --to in parallel.

Comment 18 errata-xmlrpc 2022-03-12 04:37:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.