Bug 1822752

Summary:	cluster-version operator stops applying manifests when blocked by a precondition check
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Cluster Version Operator	Assignee:	Jack Ottofaro <jack.ottofaro>
Status:	CLOSED ERRATA	QA Contact:	liujia <jiajliu>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.3.z	CC:	aos-bugs, bleanhar, jack.ottofaro, jdee, lmohanty, mzali, pmahajan, yanyang
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously, if CVO encountered an error when attempting to load a new release for an upgrade, such as release verification failure, CVO would stop reconciling current release manifests. With this change release loading is separate from reconciling and therefore the latter is no longer blocked by the former. We have also introduced a new condition, ReleaseAccepted, to report the status of the release load.	Story Points:	---
Clone Of:
Clones:	1822922 2064991 (view as bug list)		Environment:
Last Closed:	2022-08-10 10:35:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1822922, 2064991

Description W. Trevor King 2020-04-09 18:51:34 UTC

For example, blocking on 4.3.10 -> 4.3.11 via the mechanism in bug 1821905:

$ oc get -o json clusteroperators | jq -r '.items[] | .upgradeable = ([.status.conditions[] | select(.type == "Upgradeable")][0]) | select(.upgradeable.status == "False") | .upgradeable.lastTransitionTime + " " + .metadata.name + " " + .upgradeable.reason' | sort
...no output...
$ oc patch scc privileged --type json -p '[{"op": "add", "path": "/users/-", "value": "kubeadmin"}]'
$ oc get -o json clusteroperators | jq -r '.items[] | .upgradeable = ([.status.conditions[] | select(.type == "Upgradeable")][0]) | select(.upgradeable.status == "False") | .upgradeable.lastTransitionTime + " " + .metadata.name + " " + .upgradeable.reason' | sort
2020-04-09T18:12:01Z kube-apiserver DefaultSecurityContextConstraints_Mutated
$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.3"}]'
$ oc adm upgrade --to 4.3.11
Updating to 4.3.11
$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort
2020-04-09T18:00:34Z Available True Done applying 4.3.10
2020-04-09T18:13:10Z Upgradeable False Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
2020-04-09T18:27:21Z RetrievedUpdates True
2020-04-09T18:27:49Z Progressing True Unable to apply 4.3.11: it may not be safe to apply this update
2020-04-09T18:28:19Z Failing True Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
$ oc get -o json clusterversion version | jq -r '.status.availableUpdates[].version'
4.3.11
$ oc adm upgrade --to 4.3.10
error: The update 4.3.10 is not one of the available updates: 4.3.11
$ oc -n openshift-cluster-version get -o json pods | jq -r '.items[] | select(.metadata.name | startswith("cluster-version-operator")).metadata.name'
cluster-version-operator-7bbc4c5dcc-w287k
$ oc -n openshift-cluster-version get -o json pods | jq -r '.items[] | select(.metadata.name == "cluster-version-operator-7bbc4c5dcc-w287k").spec.containers[].image'
registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
$ oc image info -o json registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd | jq -r '.config.config.Labels["io.openshift.release"]'
4.3.10
$ oc -n openshift-cluster-version logs cluster-version-operator-7bbc4c5dcc-w287k
I0409 17:38:45.549400       1 start.go:19] ClusterVersionOperator v4.3.10-202003311428-dirty
...
I0409 18:26:05.631324       1 sync_worker.go:634] Done syncing for clusteroperator "service-ca" (388 of 498)
I0409 18:26:05.631365       1 task_graph.go:524] Graph is complete
I0409 18:26:05.631390       1 task_graph.go:587] No more work for 1
I0409 18:26:05.631402       1 task_graph.go:587] No more work for 0
I0409 18:26:05.631414       1 task_graph.go:603] Workers finished
I0409 18:26:05.631425       1 task_graph.go:611] Result of work: []
...
I0409 18:27:47.543958       1 sync_worker.go:471] Running sync 4.3.11 (force=false) on generation 3 in state Updating at attempt 0
...
I0409 18:40:26.944180       1 sync_worker.go:471] Running sync 4.3.11 (force=false) on generation 3 in state Updating at attempt 4
I0409 18:40:26.944219       1 sync_worker.go:477] Loading payload
I0409 18:40:27.247572       1 payload.go:210] Loading updatepayload from "/etc/cvo/updatepayloads/0vmj3337PDbtCKhSPEWayQ"
E0409 18:40:27.549821       1 precondition.go:49] Precondition "ClusterVersionUpgradeable" failed: Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]

So that is the 4.3.10 CVO saying:

> They've asked me to update to 4.3.11, let me stop applying manifests and take a look at the preconditions.  Oh no!  A precondition check is failing!  I will complain about it until it gets fixed, but in the meantime I will do nothing about manifest reconciliation and hope that nobody in the cluster is stomping on manifests which I'm supposed to be monitoring.

I think that's a bug, and that we want the CVO pod to continue to reconcile manifests while it works through update preconditions to vet the proposed target.

It should also be possible for admins to say "ah, precondition failed, please forget I asked and return to the source version".  As the "error: The update 4.3.10 is not one of the available updates: 4.3.11" shows, that is not currently possible either (without using --force or other risky stuff).

Comment 1 W. Trevor King 2020-04-09 18:54:14 UTC

Also, without dipping into .status.history, there is nothing in ClusterVersion to show that we're actually still running 4.3.10:

$ oc get clusterversion -o yaml version
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
  creationTimestamp: "2020-04-09T17:35:43Z"
  generation: 3
  name: version
  resourceVersion: "27378"
  selfLink: /apis/config.openshift.io/v1/clusterversions/version
  uid: 4ab0e9a9-bf0b-4ca7-b08a-51c803b5b1da
spec:
  channel: candidate-4.3
  clusterID: c42dd7e1-...
  desiredUpdate:
    force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  upstream: https://api.openshift.com/api/upgrades_info/v1/graph
status:
  availableUpdates:
  - force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  conditions:
  - lastTransitionTime: "2020-04-09T18:00:34Z"
    message: Done applying 4.3.10
    status: "True"
    type: Available
  - lastTransitionTime: "2020-04-09T18:28:19Z"
    message: 'Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated":
      Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
      Default SecurityContextConstraints object(s) have mutated [privileged]'
    reason: UpgradePreconditionCheckFailed
    status: "True"
    type: Failing
  - lastTransitionTime: "2020-04-09T18:27:49Z"
    message: 'Unable to apply 4.3.11: it may not be safe to apply this update'
    reason: UpgradePreconditionCheckFailed
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-04-09T18:27:21Z"
    status: "True"
    type: RetrievedUpdates
  - lastTransitionTime: "2020-04-09T18:13:10Z"
    message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
      Default SecurityContextConstraints object(s) have mutated [privileged]'
    reason: DefaultSecurityContextConstraints_Mutated
    status: "False"
    type: Upgradeable
  desired:
    force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  history:
  - completionTime: null
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    startedTime: "2020-04-09T18:27:49Z"
    state: Partial
    verified: true
    version: 4.3.11
  - completionTime: "2020-04-09T18:00:34Z"
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    startedTime: "2020-04-09T17:35:48Z"
    state: Completed
    verified: false
    version: 4.3.10
  observedGeneration: 3
  versionHash: vSLGMQhseGg=

Comment 2 W. Trevor King 2020-04-09 18:57:06 UTC

Ok, so there is an unforced-ish way out:

$ oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd --allow-explicit-upgrade
error: Already upgrading, pass --allow-upgrade-with-warnings to override.

  Reason: UpgradePreconditionCheckFailed
  Message: Unable to apply 4.3.11: it may not be safe to apply this update

$ oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd --allow-explicit-upgrade --allow-upgrade-with-warnings
Updating to release image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd

$ oc get clusterversion -o yaml version
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
  creationTimestamp: "2020-04-09T17:35:43Z"
  generation: 4
  name: version
  resourceVersion: "34618"
  selfLink: /apis/config.openshift.io/v1/clusterversions/version
  uid: 4ab0e9a9-bf0b-4ca7-b08a-51c803b5b1da
spec:
  channel: candidate-4.3
  clusterID: c42dd7e1-...
  desiredUpdate:
    force: false
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    version: ""
  upstream: https://api.openshift.com/api/upgrades_info/v1/graph
status:
  availableUpdates:
  - force: false
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    version: 4.3.11
  conditions:
  - lastTransitionTime: "2020-04-09T18:00:34Z"
    message: Done applying 4.3.10
    status: "True"
    type: Available
  - lastTransitionTime: "2020-04-09T18:55:16Z"
    status: "False"
    type: Failing
  - lastTransitionTime: "2020-04-09T18:55:34Z"
    message: Cluster version is 4.3.10
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-04-09T18:27:21Z"
    status: "True"
    type: RetrievedUpdates
  - lastTransitionTime: "2020-04-09T18:13:10Z"
    message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
      Default SecurityContextConstraints object(s) have mutated [privileged]'
    reason: DefaultSecurityContextConstraints_Mutated
    status: "False"
    type: Upgradeable
  desired:
    force: false
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    version: 4.3.10
  history:
  - completionTime: "2020-04-09T18:55:34Z"
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    startedTime: "2020-04-09T18:55:16Z"
    state: Completed
    verified: false
    version: 4.3.10
  - completionTime: "2020-04-09T18:55:16Z"
    image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6
    startedTime: "2020-04-09T18:27:49Z"
    state: Partial
    verified: true
    version: 4.3.11
  - completionTime: "2020-04-09T18:00:34Z"
    image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd
    startedTime: "2020-04-09T17:35:48Z"
    state: Completed
    verified: false
    version: 4.3.10
  observedGeneration: 4
  versionHash: vSLGMQhseGg=

Comment 3 Scott Dodson 2020-04-10 13:57:33 UTC

This should be backported to at least 4.3 when we fix this.

Comment 4 Lalatendu Mohanty 2020-04-10 14:22:16 UTC

We also need this in 4.4.

Comment 5 W. Trevor King 2020-04-13 18:48:36 UTC

We also need this backported to at least 4.3, and possibly as far back as we have supported 4.y.  Also, Lala and I are working through getting this implemented, so taking the assignee out of Abhinav's bucket.

Comment 6 W. Trevor King 2020-04-13 19:30:01 UTC

Also in this space, acting to resolve a precondition issue (e.g. un-modifying SCCs as outlined in [1]) does seem to unstick the update (because the CVO is not polling the preconditions to see if they are still failing or not).  Tested in 4.3.10 -> 4.3.11 by setting up the blocked update following comment 0 and then un-modifying the SCCs via [1] and the 'oc apply -f default-sccs.yaml'.  After that, the kube-apiserver ClusterOperator goes back to being happy quickly.  The CVO takes a while to notice the change, but eventually does notice and begin the update:

$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort
2020-04-13T18:49:44Z RetrievedUpdates True 
2020-04-13T19:11:59Z Available True Done applying 4.3.10
2020-04-13T19:15:44Z Progressing True Working towards 4.3.11: 24% complete
2020-04-13T19:19:10Z Failing False 

Not sure what the delay is from the CVO side yet.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1821905#c22

Comment 7 W. Trevor King 2020-04-13 23:46:07 UTC

Filling in some CVO implementation details:

* Syncronization happens via repeated syncOnce() attempts [1].
* syncOnce operates on work.Desired, performs validation and precondition checks, and then drops into parallel manifest application [2].

That makes it hard to address this issue in parallel while an existing release is being applied.  Looking into how Desired gets set:

* The operator's sync loop checks ClusterVersion looking for any changes to the spec [3].
* Update(...) is called with the desired update [4].
* Update cancels any previous workers [5].
* Update notifies the workers about new work [6].

I'm still a bit fuzzy on how the CVO deployment itself gets updated to run the new image.  Seems related to CVOManifestDir vs. ReleaseManifestDir, but so far I haven't wrapped my head around the various layers of indirection.

[1]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L108-L126
[2]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L467-L543
[3]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/cvo.go#L465-L493
[4]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L204
[5]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L231
[6]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L235

Comment 9 W. Trevor King 2020-05-28 17:20:47 UTC

This bug is important and we want to fix it, but we will probably not have time to close it out this sprint.  I'm adding UpcomingSprint now, and we'll revisit next sprint.

Comment 10 W. Trevor King 2020-06-21 14:13:03 UTC

[1] is a baby step in this direction, but is still waiting on review; UpcomingSprint

[1]: https://github.com/openshift/cluster-version-operator/pull/349

Comment 11 W. Trevor King 2020-07-10 21:36:46 UTC

cvo#349 landed :).  Will get the next step in this direction up next sprint.

Comment 12 W. Trevor King 2020-08-01 05:38:40 UTC

Sprint is over.  Assorted Context work that landed in this sprint (e.g. [1,2]) moves us in a good direction, but still more work to do.


[1]: https://github.com/openshift/cluster-version-operator/pull/410
[2]: https://github.com/openshift/cluster-version-operator/pull/420

Comment 13 W. Trevor King 2020-08-21 17:05:57 UTC

Work consolidating goroutine handling continues in [1].  Once that lands we may be close enough to move on this particular bug.  But sprint ends today, and this won't happen by then ;).

[1]: https://github.com/openshift/cluster-version-operator/pull/424

Comment 14 W. Trevor King 2020-09-12 21:13:50 UTC

Lala's been working on this, but no PR up yet.  Hopefully during the next sprint.

Comment 15 W. Trevor King 2020-09-22 21:29:56 UTC

I'm still optimistic that we will get this this sprint, but it's not strictly a 4.6 GA blocker, so moving to 4.7.

Comment 16 W. Trevor King 2020-10-02 23:14:59 UTC

Lala's been working on this, but no PR up yet.  Hopefully during the next sprint.

Comment 17 W. Trevor King 2020-10-25 15:50:27 UTC

Comment 16 is still current.

Comment 18 W. Trevor King 2020-10-28 11:41:49 UTC

Not a blocker.

Comment 19 W. Trevor King 2020-10-28 11:43:49 UTC

Clearing again, because we want to make sure the bot restores the flag...

Comment 20 W. Trevor King 2020-10-28 12:14:47 UTC

Bot added it back :).  Setting to blocker- to show that it's not a blocker.

Comment 21 W. Trevor King 2020-12-04 22:46:52 UTC

I don't think Lala ever got a PR up.  Moving back to NEW until someone has time to work on this.

Comment 22 Lalatendu Mohanty 2021-01-05 17:28:51 UTC

We have not seen much complains about the issue , hence reducing the severity of the issue to medium.

Comment 23 Brenton Leanhardt 2021-02-02 17:13:15 UTC

We may move this to an RFE if we don't address it soon.

Comment 24 W. Trevor King 2021-03-18 17:55:59 UTC

Moving back to NEW, because we can't assign this to the team.

Comment 25 W. Trevor King 2021-05-10 21:06:44 UTC

Steps to reproduce, in a cluster-bot 4.9 cluster:

  $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version
  4.7.9

Update to a CI build, which won't be trusted by a signature 4.7.9 trusts:

  $ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:3bbe996c56a84a904d6d
  20da76078a39b44bb6fc13478545fe6e98e38c2144a0

CVO complains about not being able to update:

  $ oc adm upgrade 
  info: An upgrade is in progress. Unable to apply registry.ci.openshift.org/ocp/release@sha256:3bbe996c56a84a904d6d20da76078a39b44bb6fc13478545fe6e98e38c2144a0: the image may not be safe to use
  ...

Wait 10 minutes...  Confirm that it's been a while:

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'  | sort
  2021-05-10T20:15:32Z Available=True -: Done applying 4.7.9
  2021-05-10T20:47:18Z RetrievedUpdates=True -: -
  2021-05-10T20:51:21Z Failing=True ImageVerificationFailed: The update cannot be verified: unable to locate a valid signature for one or more sources
  2021-05-10T20:51:21Z Progressing=True ImageVerificationFailed: Unable to apply registry.ci.openshift.org/ocp/release@sha256:3bbe996c56a84a904d6d20da76078a39b44bb6fc13478545fe6e98e38c2144a0: the image may not be safe to use
  $ date --utc --iso=m
  2021-05-10T21:02+00:00

Confirm that we haven't attempted to reconcile any manifests in the interim:

  $ oc -n openshift-cluster-version get pods
  NAME                                        READY   STATUS    RESTARTS   AGE
  cluster-version-operator-747f97b7d4-n99jm   1/1     Running   1          73m
  $ oc -n openshift-cluster-version logs cluster-version-operator-747f97b7d4-n99jm | grep 'Running sync for.* of ' | tail -n1
  I0510 20:48:31.392910       1 sync_worker.go:769] Running sync for role "openshift-marketplace/openshift-marketplace-metrics" (516 of 668)

The issue is that we currently:

1. See the user bump spec.desiredUpdate.
2. Cancel the old sync loop [1].
3. Try to verify the new release [2] (so during this time we are no longer reconciling the old release target).

Instead we want:

1. See the user bump spec.desiredUpdate.
2. We try to verify the new release (so during this time we continue to reconcile the old release target).
3. We cancel the old sync loop and start reconciling the new release target.

[1]: https://github.com/openshift/cluster-version-operator/blob/86db02a657e2101270873d625efab9c1490c6f25/pkg/cvo/sync_worker.go#L245
[2]: https://github.com/openshift/cluster-version-operator/blob/86db02a657e2101270873d625efab9c1490c6f25/pkg/cvo/sync_worker.go#L560

Comment 27 liujia 2021-11-11 09:21:40 UTC

Reproduced the issues from function test side on 4.10.0-0.nightly-2021-11-09-181140.

Scenario 1(corresponding to the issue in description):
1. Install a v4.10 cluster

2. Patch cv with internal upstream

3. Pick up a available version without signature.
# ./oc adm upgrade --to 4.10.0-0.nightly-2021-11-10-212548
Updating to 4.10.0-0.nightly-2021-11-10-212548
# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-09-181140   True        True          14m     Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use
# ./oc adm upgrade
info: An upgrade is in progress. Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use

Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph
Channel: stable-4.10
Available Updates:

VERSION                            IMAGE
4.10.0-0.nightly-2021-11-10-212548 registry.ci.openshift.org/ocp/release@sha256:b15acfa35c303c15148e1032774c91df0b38ea2b3efee4d8c408777d64467c70

// Above shows the cluster is still at 4.10.0-0.nightly-2021-11-09-181140 and 4.10.0-0.nightly-2021-11-10-212548 is still in available list.

4. Go on to do upgrade to 4.10.0-0.nightly-2021-11-10-212548.
# ./oc adm upgrade --to 4.10.0-0.nightly-2021-11-10-212548
info: Cluster is already at version 4.10.0-0.nightly-2021-11-10-212548 (// Here the information is not correct, which hint a wrong cluster version)

5. Let's go back to 4.10.0-0.nightly-2021-11-09-181140.
# ./oc adm upgrade --to 4.10.0-0.nightly-2021-11-09-181140
error: The update 4.10.0-0.nightly-2021-11-09-181140 is not one of the available updates: 4.10.0-0.nightly-2021-11-10-212548

6. Then we can cancel the upgrade(// Could we add the cancel way in above information if users don't know how to get rid of the predicament?)
# ./oc adm upgrade --clear
Cleared the update field, still at 4.10.0-0.nightly-2021-11-10-212548 (// The version is still not correct, and the information still makes user confused, because the cancel actually happened, and it will go to original version)
# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-09-181140   True        True          23m     Working towards 4.10.0-0.nightly-2021-11-09-181140: 363 of 756 done (48% complete)
# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-09-181140   True        False         4m12s   Cluster version is 4.10.0-0.nightly-2021-11-09-181140



Scenario 2(Corresponding to comment25):
1. Install a v4.10 cluster

2. Check current marketplace-operator deployment
# ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "25%"
}
# ./oc logs cluster-version-operator-5d4fc6b786-wsc8x| grep 'Running sync for deployment.*openshift-marketplace'|tail -n1
I1111 08:31:04.024795       1 sync_worker.go:753] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756)

3. Upgrade it to an unsigned payload.
# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-09-181140   True        True          14m     Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use

4. During above blocked status, change maxUnavailable of marketplace-operator deployment manually
# ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "50%"
}

5. Wait for 10min, check the deployment was not reconciled.
# ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "50%"
}
# ./oc logs cluster-version-operator-5d4fc6b786-wsc8x| grep 'Running sync for deployment.*openshift-marketplace'|tail -n1
I1111 08:31:04.024795       1 sync_worker.go:753] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756)

Expected action: during the blocked time, it continue to reconcile the old release target since the cluster is still at the original version.

Comment 28 liujia 2021-11-17 03:22:06 UTC

Checked the cluster that launched by cluster-bot: 4.10,openshift/cluster-version-operator#683

Scenario 2-verified

1. Check current marketplace-operator deployment
# ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "25%"
}
# ./oc logs cluster-version-operator-6cd5c78c4b-gnnmg|grep 'Running sync for deployment.*openshift-marketplace'|tail -n1
I1117 03:02:53.310469       1 sync_worker.go:907] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756)

2. Upgrade it to an unsigned payload.
# ./oc get clusterversion
NAME      VERSION                                               AVAILABLE   PROGRESSING   SINCE   STATUS
version   0.0.1-0.test-2021-11-17-015750-ci-ln-pb6kg2k-latest   True        True          41s     Unable to apply registry.ci.openshift.org/ocp/release@sha256:1c905c1a02fd2f0ceb65c8e17dd2b6ed40e0b59d07707f8ba4c122b924068107: the image may not be safe to use

3. During above blocked status, change maxUnavailable of marketplace-operator deployment manually
# ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "50%"
}

So the issue for 

4. Wait for several minutes, check the deployment was reconciled.
# ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "25%"
}
# ./oc logs cluster-version-operator-6cd5c78c4b-gnnmg|grep 'Running sync for deployment.*openshift-marketplace'|tail -n1
I1117 03:11:33.593901       1 sync_worker.go:907] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756)

So the issue for scenario 2 should be fixed in pr683 now. As for the issue for scenario 1, i think the fix should come from oc. 

Hi, Jack, would you mind me to file a new bug to track the oc issue for scenario 1, or would you like to keep this bug tracking both of issues?

Comment 29 liujia 2021-11-18 02:41:10 UTC

File https://bugzilla.redhat.com/show_bug.cgi?id=2024398 to track the oc adm upgrade issue.

According to Comment 28, mark the bug as "tested".

Comment 30 Jack Ottofaro 2021-11-18 15:03:43 UTC

(In reply to liujia from comment #29)
> File https://bugzilla.redhat.com/show_bug.cgi?id=2024398 to track the oc adm
> upgrade issue.
> 
> According to Comment 28, mark the bug as "tested".

Hi Jia Liu, 

Yeah, opening the new bug is fine with me. I'm still working on the original issue but keep getting pulled to other tasks.

Comment 31 liujia 2021-11-19 00:26:24 UTC

Pr #683 was updated to WIP for a further work, so the bug need verify again. Remove `Tested` label

Comment 33 Brenton Leanhardt 2022-01-24 17:30:44 UTC

We're going to target this for 4.11.

Comment 38 W. Trevor King 2022-02-24 19:24:29 UTC

*** Bug 1826115 has been marked as a duplicate of this bug. ***

Comment 39 liujia 2022-03-02 07:29:01 UTC

Tried to verify it on 4.11.0-0.nightly-2022-02-24-054925 following comment 28, but seemd the cvo's behavior was changed in v4.11, so i could not verify it with the steps on v4.10.

1. Upgrade cluster to an unsigned payload.
# ./oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
Updating to release image registry.ci.openshift.org/ocp/release@sha256:3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb

2. Check the upgrade status with `oc adm upgrade`, it returns nothing about the upgrade status.(a regression?)
# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-02-24-054925   True        False         25m     Cluster version is 4.11.0-0.nightly-2022-02-24-054925

# ./oc adm upgrade
Cluster version is 4.11.0-0.nightly-2022-02-24-054925

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.11
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-02-24-054925 not found in the "stable-4.11" channel

3. Check it more in cv.condition to find that the image check failed and the initial upgrade is not in Progressing status.

    - lastTransitionTime: "2022-03-02T04:51:42Z"
      message: 'Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb"
        failure=The update cannot be verified: unable to locate a valid signature
        for one or more sources'
      reason: RetrievePayload
      status: "False"
      type: ReleaseAccepted
...
    - lastTransitionTime: "2022-03-02T04:52:40Z"
      message: Cluster version is 4.11.0-0.nightly-2022-02-24-054925
      status: "False"
      type: Progressing

As above, the cluster seems not in a blocked status for precondition check. 

@Jack
Is this expected behavior in v4.11? If so, there will not be blocked status for the cluster. Once the precondition check fail, the update stop and back to reconcile status, right?


Anyway, let's continue for a regression test only when it's in reconcile status.

1. Patch maxUnavailable of marketplace-operator deployment 
# ./oc patch -n openshift-marketplace deployment/marketplace-operator --type=json -p '[{"op": "replace", "path": "/spec/strategy/rollingUpdate/maxUnavailable", "value": "50%"}]'
deployment.apps/marketplace-operator patched
# ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "50%"
}
2. wait for several minutes, and check the resource reconciled back to 25%
# ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "25%"
}

# ./oc -n openshift-cluster-version logs cluster-version-operator-77479cd88b-pc455|grep 'Running sync for deployment.*openshift-marketplace'|tail -n5
I0302 06:45:04.369528       1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772)
I0302 06:49:32.957162       1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772)
I0302 06:54:05.395362       1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772)
I0302 06:58:33.889240       1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772)
I0302 07:03:06.329970       1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772)

Reconcile works well.

Comment 40 W. Trevor King 2022-03-02 07:50:34 UTC

(In reply to liujia from comment #39)
> 2. Check the upgrade status with `oc adm upgrade`, it returns nothing about
> the upgrade status.(a regression?)
> ...
>     - lastTransitionTime: "2022-03-02T04:51:42Z"
>       message: 'Retrieving payload failed version=""
> image="registry.ci.openshift.org/ocp/release@sha256:
> 3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb"
>         failure=The update cannot be verified: unable to locate a valid
> signature
>         for one or more sources'
>       reason: RetrievePayload
>       status: "False"
>       type: ReleaseAccepted

We should probably teach 'oc adm upgrade' to include the new ReleaseAccepted condition.  I don't think that blocks us from verifying this CVO bug, though.

> Is this expected behavior in v4.11?

This is expected.  Per comment 0 and comment 25, the issue this bug was aimed at was the CVO giving up on the current version once it began considering the new version.  And your Deployment patch getting stomped shows that that bug has been fixed.

> If so, there will not be blocked status for the cluster. Once the precondition check fail, the update stop and back
> to reconcile status, right?

I'm having trouble parsing this.  Can you rephrase?

Comment 41 liujia 2022-03-02 08:45:46 UTC

> I'm having trouble parsing this.  Can you rephrase?
Let me try to rephrase it in detail.

In v4.10, when trying to upgrade to an unsigned payload, the upgrade will be blocked on the precondition check(with PROGRESSING=True), and the issue is that CVO give up syncing the Deployment while it keeps trying to the new version(PROGRESSING=True).
# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-09-181140   True        True          14m     Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use

Now in v4.11's verification, we still tried to upgrade to an unsigned payload, the upgrade will give up directly(with PROGRESSING=False) due to the precondition check. I'm not sure if we could still call it's in a blocked status because it looks like that cvo also give up to the new version and stay at current version. 
# ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-02-24-054925   True        False         25m     Cluster version is 4.11.0-0.nightly-2022-02-24-054925

In this situation, i think it's the same with a normal cluster's reconcile when do Deployment sync. So i wonder if this verify looks good to you.

Comment 42 Jack Ottofaro 2022-03-02 14:01:28 UTC

(In reply to liujia from comment #41)
> > I'm having trouble parsing this.  Can you rephrase?
> Let me try to rephrase it in detail.
> 
> In v4.10, when trying to upgrade to an unsigned payload, the upgrade will be
> blocked on the precondition check(with PROGRESSING=True), and the issue is
> that CVO give up syncing the Deployment while it keeps trying to the new
> version(PROGRESSING=True).
> # ./oc get clusterversion
> NAME      VERSION                              AVAILABLE   PROGRESSING  
> SINCE   STATUS
> version   4.10.0-0.nightly-2021-11-09-181140   True        True          14m
> Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be
> safe to use
> 
> Now in v4.11's verification, we still tried to upgrade to an unsigned
> payload, the upgrade will give up directly(with PROGRESSING=False) due to
> the precondition check. I'm not sure if we could still call it's in a
> blocked status because it looks like that cvo also give up to the new
> version and stay at current version. 
> # ./oc get clusterversion
> NAME      VERSION                              AVAILABLE   PROGRESSING  
> SINCE   STATUS
> version   4.11.0-0.nightly-2022-02-24-054925   True        False         25m
> Cluster version is 4.11.0-0.nightly-2022-02-24-054925
> 
> In this situation, i think it's the same with a normal cluster's reconcile
> when do Deployment sync. So i wonder if this verify looks good to you.

With the new logic and the new ReleaseAccepted condition, CVO never really attempts to do an Upgrade since it fails loading the desired released. So we only set the ReleaseAccepted condition to indicate that the desired release load failed. That's also why the desired release does not show up in the History.

Comment 43 liujia 2022-03-03 00:50:41 UTC

> With the new logic and the new ReleaseAccepted condition, CVO never really attempts to do an Upgrade since it fails loading the desired released. So we only set the ReleaseAccepted condition to indicate that the desired release load failed. That's also why the desired release does not show up in the History.

So we can say there is not any blocked update status, right? And the verify in comment39 is based on a ReleaseAccepted=false condition. Is that ok to verify the issue?

BTW, about the new ReleaseAccepted condition, now we could not get the status from `oc adm upgrade`, is it already tracked somewhere?

Comment 44 Jack Ottofaro 2022-03-03 13:51:07 UTC

(In reply to liujia from comment #43)
> > With the new logic and the new ReleaseAccepted condition, CVO never really attempts to do an Upgrade since it fails loading the desired released. So we only set the ReleaseAccepted condition to indicate that the desired release load failed. That's also why the desired release does not show up in the History.
> 
> So we can say there is not any blocked update status, right? And the verify
> in comment39 is based on a ReleaseAccepted=false condition. Is that ok to
> verify the issue?
>
Yes

> BTW, about the new ReleaseAccepted condition, now we could not get the
> status from `oc adm upgrade`, is it already tracked somewhere?
Created https://issues.redhat.com/browse/OTA-589

Comment 45 liujia 2022-03-04 00:00:31 UTC

According to comment 39 and comment 44, move the bug to verify.

Comment 46 liujia 2022-03-10 06:39:31 UTC

case ocp-46017 added, remove tag.

Comment 49 errata-xmlrpc 2022-08-10 10:35:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069