For example, blocking on 4.3.10 -> 4.3.11 via the mechanism in bug 1821905: $ oc get -o json clusteroperators | jq -r '.items[] | .upgradeable = ([.status.conditions[] | select(.type == "Upgradeable")][0]) | select(.upgradeable.status == "False") | .upgradeable.lastTransitionTime + " " + .metadata.name + " " + .upgradeable.reason' | sort ...no output... $ oc patch scc privileged --type json -p '[{"op": "add", "path": "/users/-", "value": "kubeadmin"}]' $ oc get -o json clusteroperators | jq -r '.items[] | .upgradeable = ([.status.conditions[] | select(.type == "Upgradeable")][0]) | select(.upgradeable.status == "False") | .upgradeable.lastTransitionTime + " " + .metadata.name + " " + .upgradeable.reason' | sort 2020-04-09T18:12:01Z kube-apiserver DefaultSecurityContextConstraints_Mutated $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.3"}]' $ oc adm upgrade --to 4.3.11 Updating to 4.3.11 $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort 2020-04-09T18:00:34Z Available True Done applying 4.3.10 2020-04-09T18:13:10Z Upgradeable False Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged] 2020-04-09T18:27:21Z RetrievedUpdates True 2020-04-09T18:27:49Z Progressing True Unable to apply 4.3.11: it may not be safe to apply this update 2020-04-09T18:28:19Z Failing True Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged] $ oc get -o json clusterversion version | jq -r '.status.availableUpdates[].version' 4.3.11 $ oc adm upgrade --to 4.3.10 error: The update 4.3.10 is not one of the available updates: 4.3.11 $ oc -n openshift-cluster-version get -o json pods | jq -r '.items[] | select(.metadata.name | startswith("cluster-version-operator")).metadata.name' cluster-version-operator-7bbc4c5dcc-w287k $ oc -n openshift-cluster-version get -o json pods | jq -r '.items[] | select(.metadata.name == "cluster-version-operator-7bbc4c5dcc-w287k").spec.containers[].image' registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd $ oc image info -o json registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd | jq -r '.config.config.Labels["io.openshift.release"]' 4.3.10 $ oc -n openshift-cluster-version logs cluster-version-operator-7bbc4c5dcc-w287k I0409 17:38:45.549400 1 start.go:19] ClusterVersionOperator v4.3.10-202003311428-dirty ... I0409 18:26:05.631324 1 sync_worker.go:634] Done syncing for clusteroperator "service-ca" (388 of 498) I0409 18:26:05.631365 1 task_graph.go:524] Graph is complete I0409 18:26:05.631390 1 task_graph.go:587] No more work for 1 I0409 18:26:05.631402 1 task_graph.go:587] No more work for 0 I0409 18:26:05.631414 1 task_graph.go:603] Workers finished I0409 18:26:05.631425 1 task_graph.go:611] Result of work: [] ... I0409 18:27:47.543958 1 sync_worker.go:471] Running sync 4.3.11 (force=false) on generation 3 in state Updating at attempt 0 ... I0409 18:40:26.944180 1 sync_worker.go:471] Running sync 4.3.11 (force=false) on generation 3 in state Updating at attempt 4 I0409 18:40:26.944219 1 sync_worker.go:477] Loading payload I0409 18:40:27.247572 1 payload.go:210] Loading updatepayload from "/etc/cvo/updatepayloads/0vmj3337PDbtCKhSPEWayQ" E0409 18:40:27.549821 1 precondition.go:49] Precondition "ClusterVersionUpgradeable" failed: Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged] So that is the 4.3.10 CVO saying: > They've asked me to update to 4.3.11, let me stop applying manifests and take a look at the preconditions. Oh no! A precondition check is failing! I will complain about it until it gets fixed, but in the meantime I will do nothing about manifest reconciliation and hope that nobody in the cluster is stomping on manifests which I'm supposed to be monitoring. I think that's a bug, and that we want the CVO pod to continue to reconcile manifests while it works through update preconditions to vet the proposed target. It should also be possible for admins to say "ah, precondition failed, please forget I asked and return to the source version". As the "error: The update 4.3.10 is not one of the available updates: 4.3.11" shows, that is not currently possible either (without using --force or other risky stuff).
Also, without dipping into .status.history, there is nothing in ClusterVersion to show that we're actually still running 4.3.10: $ oc get clusterversion -o yaml version apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2020-04-09T17:35:43Z" generation: 3 name: version resourceVersion: "27378" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: 4ab0e9a9-bf0b-4ca7-b08a-51c803b5b1da spec: channel: candidate-4.3 clusterID: c42dd7e1-... desiredUpdate: force: false image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 version: 4.3.11 upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: - force: false image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 version: 4.3.11 conditions: - lastTransitionTime: "2020-04-09T18:00:34Z" message: Done applying 4.3.10 status: "True" type: Available - lastTransitionTime: "2020-04-09T18:28:19Z" message: 'Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]' reason: UpgradePreconditionCheckFailed status: "True" type: Failing - lastTransitionTime: "2020-04-09T18:27:49Z" message: 'Unable to apply 4.3.11: it may not be safe to apply this update' reason: UpgradePreconditionCheckFailed status: "True" type: Progressing - lastTransitionTime: "2020-04-09T18:27:21Z" status: "True" type: RetrievedUpdates - lastTransitionTime: "2020-04-09T18:13:10Z" message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]' reason: DefaultSecurityContextConstraints_Mutated status: "False" type: Upgradeable desired: force: false image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 version: 4.3.11 history: - completionTime: null image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 startedTime: "2020-04-09T18:27:49Z" state: Partial verified: true version: 4.3.11 - completionTime: "2020-04-09T18:00:34Z" image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd startedTime: "2020-04-09T17:35:48Z" state: Completed verified: false version: 4.3.10 observedGeneration: 3 versionHash: vSLGMQhseGg=
Ok, so there is an unforced-ish way out: $ oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd --allow-explicit-upgrade error: Already upgrading, pass --allow-upgrade-with-warnings to override. Reason: UpgradePreconditionCheckFailed Message: Unable to apply 4.3.11: it may not be safe to apply this update $ oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd --allow-explicit-upgrade --allow-upgrade-with-warnings Updating to release image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd $ oc get clusterversion -o yaml version apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2020-04-09T17:35:43Z" generation: 4 name: version resourceVersion: "34618" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: 4ab0e9a9-bf0b-4ca7-b08a-51c803b5b1da spec: channel: candidate-4.3 clusterID: c42dd7e1-... desiredUpdate: force: false image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd version: "" upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: - force: false image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 version: 4.3.11 conditions: - lastTransitionTime: "2020-04-09T18:00:34Z" message: Done applying 4.3.10 status: "True" type: Available - lastTransitionTime: "2020-04-09T18:55:16Z" status: "False" type: Failing - lastTransitionTime: "2020-04-09T18:55:34Z" message: Cluster version is 4.3.10 status: "False" type: Progressing - lastTransitionTime: "2020-04-09T18:27:21Z" status: "True" type: RetrievedUpdates - lastTransitionTime: "2020-04-09T18:13:10Z" message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]' reason: DefaultSecurityContextConstraints_Mutated status: "False" type: Upgradeable desired: force: false image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd version: 4.3.10 history: - completionTime: "2020-04-09T18:55:34Z" image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd startedTime: "2020-04-09T18:55:16Z" state: Completed verified: false version: 4.3.10 - completionTime: "2020-04-09T18:55:16Z" image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 startedTime: "2020-04-09T18:27:49Z" state: Partial verified: true version: 4.3.11 - completionTime: "2020-04-09T18:00:34Z" image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd startedTime: "2020-04-09T17:35:48Z" state: Completed verified: false version: 4.3.10 observedGeneration: 4 versionHash: vSLGMQhseGg=
This should be backported to at least 4.3 when we fix this.
We also need this in 4.4.
We also need this backported to at least 4.3, and possibly as far back as we have supported 4.y. Also, Lala and I are working through getting this implemented, so taking the assignee out of Abhinav's bucket.
Also in this space, acting to resolve a precondition issue (e.g. un-modifying SCCs as outlined in [1]) does seem to unstick the update (because the CVO is not polling the preconditions to see if they are still failing or not). Tested in 4.3.10 -> 4.3.11 by setting up the blocked update following comment 0 and then un-modifying the SCCs via [1] and the 'oc apply -f default-sccs.yaml'. After that, the kube-apiserver ClusterOperator goes back to being happy quickly. The CVO takes a while to notice the change, but eventually does notice and begin the update: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort 2020-04-13T18:49:44Z RetrievedUpdates True 2020-04-13T19:11:59Z Available True Done applying 4.3.10 2020-04-13T19:15:44Z Progressing True Working towards 4.3.11: 24% complete 2020-04-13T19:19:10Z Failing False Not sure what the delay is from the CVO side yet. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1821905#c22
Filling in some CVO implementation details: * Syncronization happens via repeated syncOnce() attempts [1]. * syncOnce operates on work.Desired, performs validation and precondition checks, and then drops into parallel manifest application [2]. That makes it hard to address this issue in parallel while an existing release is being applied. Looking into how Desired gets set: * The operator's sync loop checks ClusterVersion looking for any changes to the spec [3]. * Update(...) is called with the desired update [4]. * Update cancels any previous workers [5]. * Update notifies the workers about new work [6]. I'm still a bit fuzzy on how the CVO deployment itself gets updated to run the new image. Seems related to CVOManifestDir vs. ReleaseManifestDir, but so far I haven't wrapped my head around the various layers of indirection. [1]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L108-L126 [2]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L467-L543 [3]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/cvo.go#L465-L493 [4]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L204 [5]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L231 [6]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L235
This bug is important and we want to fix it, but we will probably not have time to close it out this sprint. I'm adding UpcomingSprint now, and we'll revisit next sprint.
[1] is a baby step in this direction, but is still waiting on review; UpcomingSprint [1]: https://github.com/openshift/cluster-version-operator/pull/349
cvo#349 landed :). Will get the next step in this direction up next sprint.
Sprint is over. Assorted Context work that landed in this sprint (e.g. [1,2]) moves us in a good direction, but still more work to do. [1]: https://github.com/openshift/cluster-version-operator/pull/410 [2]: https://github.com/openshift/cluster-version-operator/pull/420
Work consolidating goroutine handling continues in [1]. Once that lands we may be close enough to move on this particular bug. But sprint ends today, and this won't happen by then ;). [1]: https://github.com/openshift/cluster-version-operator/pull/424
Lala's been working on this, but no PR up yet. Hopefully during the next sprint.
I'm still optimistic that we will get this this sprint, but it's not strictly a 4.6 GA blocker, so moving to 4.7.
Comment 16 is still current.
Not a blocker.
Clearing again, because we want to make sure the bot restores the flag...
Bot added it back :). Setting to blocker- to show that it's not a blocker.
I don't think Lala ever got a PR up. Moving back to NEW until someone has time to work on this.
We have not seen much complains about the issue , hence reducing the severity of the issue to medium.
We may move this to an RFE if we don't address it soon.
Moving back to NEW, because we can't assign this to the team.
Steps to reproduce, in a cluster-bot 4.9 cluster: $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version 4.7.9 Update to a CI build, which won't be trusted by a signature 4.7.9 trusts: $ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:3bbe996c56a84a904d6d 20da76078a39b44bb6fc13478545fe6e98e38c2144a0 CVO complains about not being able to update: $ oc adm upgrade info: An upgrade is in progress. Unable to apply registry.ci.openshift.org/ocp/release@sha256:3bbe996c56a84a904d6d20da76078a39b44bb6fc13478545fe6e98e38c2144a0: the image may not be safe to use ... Wait 10 minutes... Confirm that it's been a while: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' | sort 2021-05-10T20:15:32Z Available=True -: Done applying 4.7.9 2021-05-10T20:47:18Z RetrievedUpdates=True -: - 2021-05-10T20:51:21Z Failing=True ImageVerificationFailed: The update cannot be verified: unable to locate a valid signature for one or more sources 2021-05-10T20:51:21Z Progressing=True ImageVerificationFailed: Unable to apply registry.ci.openshift.org/ocp/release@sha256:3bbe996c56a84a904d6d20da76078a39b44bb6fc13478545fe6e98e38c2144a0: the image may not be safe to use $ date --utc --iso=m 2021-05-10T21:02+00:00 Confirm that we haven't attempted to reconcile any manifests in the interim: $ oc -n openshift-cluster-version get pods NAME READY STATUS RESTARTS AGE cluster-version-operator-747f97b7d4-n99jm 1/1 Running 1 73m $ oc -n openshift-cluster-version logs cluster-version-operator-747f97b7d4-n99jm | grep 'Running sync for.* of ' | tail -n1 I0510 20:48:31.392910 1 sync_worker.go:769] Running sync for role "openshift-marketplace/openshift-marketplace-metrics" (516 of 668) The issue is that we currently: 1. See the user bump spec.desiredUpdate. 2. Cancel the old sync loop [1]. 3. Try to verify the new release [2] (so during this time we are no longer reconciling the old release target). Instead we want: 1. See the user bump spec.desiredUpdate. 2. We try to verify the new release (so during this time we continue to reconcile the old release target). 3. We cancel the old sync loop and start reconciling the new release target. [1]: https://github.com/openshift/cluster-version-operator/blob/86db02a657e2101270873d625efab9c1490c6f25/pkg/cvo/sync_worker.go#L245 [2]: https://github.com/openshift/cluster-version-operator/blob/86db02a657e2101270873d625efab9c1490c6f25/pkg/cvo/sync_worker.go#L560
Reproduced the issues from function test side on 4.10.0-0.nightly-2021-11-09-181140. Scenario 1(corresponding to the issue in description): 1. Install a v4.10 cluster 2. Patch cv with internal upstream 3. Pick up a available version without signature. # ./oc adm upgrade --to 4.10.0-0.nightly-2021-11-10-212548 Updating to 4.10.0-0.nightly-2021-11-10-212548 # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-09-181140 True True 14m Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use # ./oc adm upgrade info: An upgrade is in progress. Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph Channel: stable-4.10 Available Updates: VERSION IMAGE 4.10.0-0.nightly-2021-11-10-212548 registry.ci.openshift.org/ocp/release@sha256:b15acfa35c303c15148e1032774c91df0b38ea2b3efee4d8c408777d64467c70 // Above shows the cluster is still at 4.10.0-0.nightly-2021-11-09-181140 and 4.10.0-0.nightly-2021-11-10-212548 is still in available list. 4. Go on to do upgrade to 4.10.0-0.nightly-2021-11-10-212548. # ./oc adm upgrade --to 4.10.0-0.nightly-2021-11-10-212548 info: Cluster is already at version 4.10.0-0.nightly-2021-11-10-212548 (// Here the information is not correct, which hint a wrong cluster version) 5. Let's go back to 4.10.0-0.nightly-2021-11-09-181140. # ./oc adm upgrade --to 4.10.0-0.nightly-2021-11-09-181140 error: The update 4.10.0-0.nightly-2021-11-09-181140 is not one of the available updates: 4.10.0-0.nightly-2021-11-10-212548 6. Then we can cancel the upgrade(// Could we add the cancel way in above information if users don't know how to get rid of the predicament?) # ./oc adm upgrade --clear Cleared the update field, still at 4.10.0-0.nightly-2021-11-10-212548 (// The version is still not correct, and the information still makes user confused, because the cancel actually happened, and it will go to original version) # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-09-181140 True True 23m Working towards 4.10.0-0.nightly-2021-11-09-181140: 363 of 756 done (48% complete) # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-09-181140 True False 4m12s Cluster version is 4.10.0-0.nightly-2021-11-09-181140 Scenario 2(Corresponding to comment25): 1. Install a v4.10 cluster 2. Check current marketplace-operator deployment # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "25%" } # ./oc logs cluster-version-operator-5d4fc6b786-wsc8x| grep 'Running sync for deployment.*openshift-marketplace'|tail -n1 I1111 08:31:04.024795 1 sync_worker.go:753] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756) 3. Upgrade it to an unsigned payload. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-09-181140 True True 14m Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use 4. During above blocked status, change maxUnavailable of marketplace-operator deployment manually # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "50%" } 5. Wait for 10min, check the deployment was not reconciled. # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "50%" } # ./oc logs cluster-version-operator-5d4fc6b786-wsc8x| grep 'Running sync for deployment.*openshift-marketplace'|tail -n1 I1111 08:31:04.024795 1 sync_worker.go:753] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756) Expected action: during the blocked time, it continue to reconcile the old release target since the cluster is still at the original version.
Checked the cluster that launched by cluster-bot: 4.10,openshift/cluster-version-operator#683 Scenario 2-verified 1. Check current marketplace-operator deployment # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "25%" } # ./oc logs cluster-version-operator-6cd5c78c4b-gnnmg|grep 'Running sync for deployment.*openshift-marketplace'|tail -n1 I1117 03:02:53.310469 1 sync_worker.go:907] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756) 2. Upgrade it to an unsigned payload. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 0.0.1-0.test-2021-11-17-015750-ci-ln-pb6kg2k-latest True True 41s Unable to apply registry.ci.openshift.org/ocp/release@sha256:1c905c1a02fd2f0ceb65c8e17dd2b6ed40e0b59d07707f8ba4c122b924068107: the image may not be safe to use 3. During above blocked status, change maxUnavailable of marketplace-operator deployment manually # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "50%" } So the issue for 4. Wait for several minutes, check the deployment was reconciled. # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "25%" } # ./oc logs cluster-version-operator-6cd5c78c4b-gnnmg|grep 'Running sync for deployment.*openshift-marketplace'|tail -n1 I1117 03:11:33.593901 1 sync_worker.go:907] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756) So the issue for scenario 2 should be fixed in pr683 now. As for the issue for scenario 1, i think the fix should come from oc. Hi, Jack, would you mind me to file a new bug to track the oc issue for scenario 1, or would you like to keep this bug tracking both of issues?
File https://bugzilla.redhat.com/show_bug.cgi?id=2024398 to track the oc adm upgrade issue. According to Comment 28, mark the bug as "tested".
(In reply to liujia from comment #29) > File https://bugzilla.redhat.com/show_bug.cgi?id=2024398 to track the oc adm > upgrade issue. > > According to Comment 28, mark the bug as "tested". Hi Jia Liu, Yeah, opening the new bug is fine with me. I'm still working on the original issue but keep getting pulled to other tasks.
Pr #683 was updated to WIP for a further work, so the bug need verify again. Remove `Tested` label
We're going to target this for 4.11.
*** Bug 1826115 has been marked as a duplicate of this bug. ***
Tried to verify it on 4.11.0-0.nightly-2022-02-24-054925 following comment 28, but seemd the cvo's behavior was changed in v4.11, so i could not verify it with the steps on v4.10. 1. Upgrade cluster to an unsigned payload. # ./oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway Updating to release image registry.ci.openshift.org/ocp/release@sha256:3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb 2. Check the upgrade status with `oc adm upgrade`, it returns nothing about the upgrade status.(a regression?) # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-02-24-054925 True False 25m Cluster version is 4.11.0-0.nightly-2022-02-24-054925 # ./oc adm upgrade Cluster version is 4.11.0-0.nightly-2022-02-24-054925 Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.11 warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-02-24-054925 not found in the "stable-4.11" channel 3. Check it more in cv.condition to find that the image check failed and the initial upgrade is not in Progressing status. - lastTransitionTime: "2022-03-02T04:51:42Z" message: 'Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb" failure=The update cannot be verified: unable to locate a valid signature for one or more sources' reason: RetrievePayload status: "False" type: ReleaseAccepted ... - lastTransitionTime: "2022-03-02T04:52:40Z" message: Cluster version is 4.11.0-0.nightly-2022-02-24-054925 status: "False" type: Progressing As above, the cluster seems not in a blocked status for precondition check. @Jack Is this expected behavior in v4.11? If so, there will not be blocked status for the cluster. Once the precondition check fail, the update stop and back to reconcile status, right? Anyway, let's continue for a regression test only when it's in reconcile status. 1. Patch maxUnavailable of marketplace-operator deployment # ./oc patch -n openshift-marketplace deployment/marketplace-operator --type=json -p '[{"op": "replace", "path": "/spec/strategy/rollingUpdate/maxUnavailable", "value": "50%"}]' deployment.apps/marketplace-operator patched # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "50%" } 2. wait for several minutes, and check the resource reconciled back to 25% # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "25%" } # ./oc -n openshift-cluster-version logs cluster-version-operator-77479cd88b-pc455|grep 'Running sync for deployment.*openshift-marketplace'|tail -n5 I0302 06:45:04.369528 1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772) I0302 06:49:32.957162 1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772) I0302 06:54:05.395362 1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772) I0302 06:58:33.889240 1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772) I0302 07:03:06.329970 1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772) Reconcile works well.
(In reply to liujia from comment #39) > 2. Check the upgrade status with `oc adm upgrade`, it returns nothing about > the upgrade status.(a regression?) > ... > - lastTransitionTime: "2022-03-02T04:51:42Z" > message: 'Retrieving payload failed version="" > image="registry.ci.openshift.org/ocp/release@sha256: > 3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb" > failure=The update cannot be verified: unable to locate a valid > signature > for one or more sources' > reason: RetrievePayload > status: "False" > type: ReleaseAccepted We should probably teach 'oc adm upgrade' to include the new ReleaseAccepted condition. I don't think that blocks us from verifying this CVO bug, though. > Is this expected behavior in v4.11? This is expected. Per comment 0 and comment 25, the issue this bug was aimed at was the CVO giving up on the current version once it began considering the new version. And your Deployment patch getting stomped shows that that bug has been fixed. > If so, there will not be blocked status for the cluster. Once the precondition check fail, the update stop and back > to reconcile status, right? I'm having trouble parsing this. Can you rephrase?
> I'm having trouble parsing this. Can you rephrase? Let me try to rephrase it in detail. In v4.10, when trying to upgrade to an unsigned payload, the upgrade will be blocked on the precondition check(with PROGRESSING=True), and the issue is that CVO give up syncing the Deployment while it keeps trying to the new version(PROGRESSING=True). # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-09-181140 True True 14m Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use Now in v4.11's verification, we still tried to upgrade to an unsigned payload, the upgrade will give up directly(with PROGRESSING=False) due to the precondition check. I'm not sure if we could still call it's in a blocked status because it looks like that cvo also give up to the new version and stay at current version. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-02-24-054925 True False 25m Cluster version is 4.11.0-0.nightly-2022-02-24-054925 In this situation, i think it's the same with a normal cluster's reconcile when do Deployment sync. So i wonder if this verify looks good to you.
(In reply to liujia from comment #41) > > I'm having trouble parsing this. Can you rephrase? > Let me try to rephrase it in detail. > > In v4.10, when trying to upgrade to an unsigned payload, the upgrade will be > blocked on the precondition check(with PROGRESSING=True), and the issue is > that CVO give up syncing the Deployment while it keeps trying to the new > version(PROGRESSING=True). > # ./oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING > SINCE STATUS > version 4.10.0-0.nightly-2021-11-09-181140 True True 14m > Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be > safe to use > > Now in v4.11's verification, we still tried to upgrade to an unsigned > payload, the upgrade will give up directly(with PROGRESSING=False) due to > the precondition check. I'm not sure if we could still call it's in a > blocked status because it looks like that cvo also give up to the new > version and stay at current version. > # ./oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING > SINCE STATUS > version 4.11.0-0.nightly-2022-02-24-054925 True False 25m > Cluster version is 4.11.0-0.nightly-2022-02-24-054925 > > In this situation, i think it's the same with a normal cluster's reconcile > when do Deployment sync. So i wonder if this verify looks good to you. With the new logic and the new ReleaseAccepted condition, CVO never really attempts to do an Upgrade since it fails loading the desired released. So we only set the ReleaseAccepted condition to indicate that the desired release load failed. That's also why the desired release does not show up in the History.
> With the new logic and the new ReleaseAccepted condition, CVO never really attempts to do an Upgrade since it fails loading the desired released. So we only set the ReleaseAccepted condition to indicate that the desired release load failed. That's also why the desired release does not show up in the History. So we can say there is not any blocked update status, right? And the verify in comment39 is based on a ReleaseAccepted=false condition. Is that ok to verify the issue? BTW, about the new ReleaseAccepted condition, now we could not get the status from `oc adm upgrade`, is it already tracked somewhere?
(In reply to liujia from comment #43) > > With the new logic and the new ReleaseAccepted condition, CVO never really attempts to do an Upgrade since it fails loading the desired released. So we only set the ReleaseAccepted condition to indicate that the desired release load failed. That's also why the desired release does not show up in the History. > > So we can say there is not any blocked update status, right? And the verify > in comment39 is based on a ReleaseAccepted=false condition. Is that ok to > verify the issue? > Yes > BTW, about the new ReleaseAccepted condition, now we could not get the > status from `oc adm upgrade`, is it already tracked somewhere? Created https://issues.redhat.com/browse/OTA-589
According to comment 39 and comment 44, move the bug to verify.
case ocp-46017 added, remove tag.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069