Bug 1822752
Summary: | cluster-version operator stops applying manifests when blocked by a precondition check | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
Component: | Cluster Version Operator | Assignee: | Jack Ottofaro <jack.ottofaro> | |
Status: | CLOSED ERRATA | QA Contact: | liujia <jiajliu> | |
Severity: | low | Docs Contact: | ||
Priority: | low | |||
Version: | 4.3.z | CC: | aos-bugs, bleanhar, jack.ottofaro, jdee, lmohanty, mzali, pmahajan, yanyang | |
Target Milestone: | --- | Keywords: | Upgrades | |
Target Release: | 4.11.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Previously, if CVO encountered an error when attempting to load a new release for an upgrade, such as release verification failure, CVO would stop reconciling current release manifests.
With this change release loading is separate from reconciling and therefore the latter is no longer blocked by the former. We have also introduced a new condition, ReleaseAccepted, to report the status of the release load.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1822922 2064991 (view as bug list) | Environment: | ||
Last Closed: | 2022-08-10 10:35:34 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1822922, 2064991 |
Description
W. Trevor King
2020-04-09 18:51:34 UTC
Also, without dipping into .status.history, there is nothing in ClusterVersion to show that we're actually still running 4.3.10: $ oc get clusterversion -o yaml version apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2020-04-09T17:35:43Z" generation: 3 name: version resourceVersion: "27378" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: 4ab0e9a9-bf0b-4ca7-b08a-51c803b5b1da spec: channel: candidate-4.3 clusterID: c42dd7e1-... desiredUpdate: force: false image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 version: 4.3.11 upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: - force: false image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 version: 4.3.11 conditions: - lastTransitionTime: "2020-04-09T18:00:34Z" message: Done applying 4.3.10 status: "True" type: Available - lastTransitionTime: "2020-04-09T18:28:19Z" message: 'Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]' reason: UpgradePreconditionCheckFailed status: "True" type: Failing - lastTransitionTime: "2020-04-09T18:27:49Z" message: 'Unable to apply 4.3.11: it may not be safe to apply this update' reason: UpgradePreconditionCheckFailed status: "True" type: Progressing - lastTransitionTime: "2020-04-09T18:27:21Z" status: "True" type: RetrievedUpdates - lastTransitionTime: "2020-04-09T18:13:10Z" message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]' reason: DefaultSecurityContextConstraints_Mutated status: "False" type: Upgradeable desired: force: false image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 version: 4.3.11 history: - completionTime: null image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 startedTime: "2020-04-09T18:27:49Z" state: Partial verified: true version: 4.3.11 - completionTime: "2020-04-09T18:00:34Z" image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd startedTime: "2020-04-09T17:35:48Z" state: Completed verified: false version: 4.3.10 observedGeneration: 3 versionHash: vSLGMQhseGg= Ok, so there is an unforced-ish way out: $ oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd --allow-explicit-upgrade error: Already upgrading, pass --allow-upgrade-with-warnings to override. Reason: UpgradePreconditionCheckFailed Message: Unable to apply 4.3.11: it may not be safe to apply this update $ oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd --allow-explicit-upgrade --allow-upgrade-with-warnings Updating to release image registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd $ oc get clusterversion -o yaml version apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2020-04-09T17:35:43Z" generation: 4 name: version resourceVersion: "34618" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: 4ab0e9a9-bf0b-4ca7-b08a-51c803b5b1da spec: channel: candidate-4.3 clusterID: c42dd7e1-... desiredUpdate: force: false image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd version: "" upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: - force: false image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 version: 4.3.11 conditions: - lastTransitionTime: "2020-04-09T18:00:34Z" message: Done applying 4.3.10 status: "True" type: Available - lastTransitionTime: "2020-04-09T18:55:16Z" status: "False" type: Failing - lastTransitionTime: "2020-04-09T18:55:34Z" message: Cluster version is 4.3.10 status: "False" type: Progressing - lastTransitionTime: "2020-04-09T18:27:21Z" status: "True" type: RetrievedUpdates - lastTransitionTime: "2020-04-09T18:13:10Z" message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]' reason: DefaultSecurityContextConstraints_Mutated status: "False" type: Upgradeable desired: force: false image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd version: 4.3.10 history: - completionTime: "2020-04-09T18:55:34Z" image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd startedTime: "2020-04-09T18:55:16Z" state: Completed verified: false version: 4.3.10 - completionTime: "2020-04-09T18:55:16Z" image: quay.io/openshift-release-dev/ocp-release@sha256:ec07f30d2659d3e279b16055331fc9c3c0ba99f313e5026fddb5a7b2d54c6eb6 startedTime: "2020-04-09T18:27:49Z" state: Partial verified: true version: 4.3.11 - completionTime: "2020-04-09T18:00:34Z" image: registry.svc.ci.openshift.org/ocp/release@sha256:edb4364367cff4f751ffdc032bc830a469548f998127b523047a8dd518c472cd startedTime: "2020-04-09T17:35:48Z" state: Completed verified: false version: 4.3.10 observedGeneration: 4 versionHash: vSLGMQhseGg= This should be backported to at least 4.3 when we fix this. We also need this in 4.4. We also need this backported to at least 4.3, and possibly as far back as we have supported 4.y. Also, Lala and I are working through getting this implemented, so taking the assignee out of Abhinav's bucket. Also in this space, acting to resolve a precondition issue (e.g. un-modifying SCCs as outlined in [1]) does seem to unstick the update (because the CVO is not polling the preconditions to see if they are still failing or not). Tested in 4.3.10 -> 4.3.11 by setting up the blocked update following comment 0 and then un-modifying the SCCs via [1] and the 'oc apply -f default-sccs.yaml'. After that, the kube-apiserver ClusterOperator goes back to being happy quickly. The CVO takes a while to notice the change, but eventually does notice and begin the update: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort 2020-04-13T18:49:44Z RetrievedUpdates True 2020-04-13T19:11:59Z Available True Done applying 4.3.10 2020-04-13T19:15:44Z Progressing True Working towards 4.3.11: 24% complete 2020-04-13T19:19:10Z Failing False Not sure what the delay is from the CVO side yet. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1821905#c22 Filling in some CVO implementation details: * Syncronization happens via repeated syncOnce() attempts [1]. * syncOnce operates on work.Desired, performs validation and precondition checks, and then drops into parallel manifest application [2]. That makes it hard to address this issue in parallel while an existing release is being applied. Looking into how Desired gets set: * The operator's sync loop checks ClusterVersion looking for any changes to the spec [3]. * Update(...) is called with the desired update [4]. * Update cancels any previous workers [5]. * Update notifies the workers about new work [6]. I'm still a bit fuzzy on how the CVO deployment itself gets updated to run the new image. Seems related to CVOManifestDir vs. ReleaseManifestDir, but so far I haven't wrapped my head around the various layers of indirection. [1]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L108-L126 [2]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L467-L543 [3]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/cvo.go#L465-L493 [4]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L204 [5]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L231 [6]: https://github.com/openshift/cluster-version-operator/blob/21c4c353ca47a5c9e82940c2599c3649d1b7cb02/pkg/cvo/sync_worker.go#L235 This bug is important and we want to fix it, but we will probably not have time to close it out this sprint. I'm adding UpcomingSprint now, and we'll revisit next sprint. [1] is a baby step in this direction, but is still waiting on review; UpcomingSprint [1]: https://github.com/openshift/cluster-version-operator/pull/349 cvo#349 landed :). Will get the next step in this direction up next sprint. Sprint is over. Assorted Context work that landed in this sprint (e.g. [1,2]) moves us in a good direction, but still more work to do. [1]: https://github.com/openshift/cluster-version-operator/pull/410 [2]: https://github.com/openshift/cluster-version-operator/pull/420 Work consolidating goroutine handling continues in [1]. Once that lands we may be close enough to move on this particular bug. But sprint ends today, and this won't happen by then ;). [1]: https://github.com/openshift/cluster-version-operator/pull/424 Lala's been working on this, but no PR up yet. Hopefully during the next sprint. I'm still optimistic that we will get this this sprint, but it's not strictly a 4.6 GA blocker, so moving to 4.7. Lala's been working on this, but no PR up yet. Hopefully during the next sprint. Comment 16 is still current. Not a blocker. Clearing again, because we want to make sure the bot restores the flag... Bot added it back :). Setting to blocker- to show that it's not a blocker. I don't think Lala ever got a PR up. Moving back to NEW until someone has time to work on this. We have not seen much complains about the issue , hence reducing the severity of the issue to medium. We may move this to an RFE if we don't address it soon. Moving back to NEW, because we can't assign this to the team. Steps to reproduce, in a cluster-bot 4.9 cluster: $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version 4.7.9 Update to a CI build, which won't be trusted by a signature 4.7.9 trusts: $ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:3bbe996c56a84a904d6d 20da76078a39b44bb6fc13478545fe6e98e38c2144a0 CVO complains about not being able to update: $ oc adm upgrade info: An upgrade is in progress. Unable to apply registry.ci.openshift.org/ocp/release@sha256:3bbe996c56a84a904d6d20da76078a39b44bb6fc13478545fe6e98e38c2144a0: the image may not be safe to use ... Wait 10 minutes... Confirm that it's been a while: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' | sort 2021-05-10T20:15:32Z Available=True -: Done applying 4.7.9 2021-05-10T20:47:18Z RetrievedUpdates=True -: - 2021-05-10T20:51:21Z Failing=True ImageVerificationFailed: The update cannot be verified: unable to locate a valid signature for one or more sources 2021-05-10T20:51:21Z Progressing=True ImageVerificationFailed: Unable to apply registry.ci.openshift.org/ocp/release@sha256:3bbe996c56a84a904d6d20da76078a39b44bb6fc13478545fe6e98e38c2144a0: the image may not be safe to use $ date --utc --iso=m 2021-05-10T21:02+00:00 Confirm that we haven't attempted to reconcile any manifests in the interim: $ oc -n openshift-cluster-version get pods NAME READY STATUS RESTARTS AGE cluster-version-operator-747f97b7d4-n99jm 1/1 Running 1 73m $ oc -n openshift-cluster-version logs cluster-version-operator-747f97b7d4-n99jm | grep 'Running sync for.* of ' | tail -n1 I0510 20:48:31.392910 1 sync_worker.go:769] Running sync for role "openshift-marketplace/openshift-marketplace-metrics" (516 of 668) The issue is that we currently: 1. See the user bump spec.desiredUpdate. 2. Cancel the old sync loop [1]. 3. Try to verify the new release [2] (so during this time we are no longer reconciling the old release target). Instead we want: 1. See the user bump spec.desiredUpdate. 2. We try to verify the new release (so during this time we continue to reconcile the old release target). 3. We cancel the old sync loop and start reconciling the new release target. [1]: https://github.com/openshift/cluster-version-operator/blob/86db02a657e2101270873d625efab9c1490c6f25/pkg/cvo/sync_worker.go#L245 [2]: https://github.com/openshift/cluster-version-operator/blob/86db02a657e2101270873d625efab9c1490c6f25/pkg/cvo/sync_worker.go#L560 Reproduced the issues from function test side on 4.10.0-0.nightly-2021-11-09-181140. Scenario 1(corresponding to the issue in description): 1. Install a v4.10 cluster 2. Patch cv with internal upstream 3. Pick up a available version without signature. # ./oc adm upgrade --to 4.10.0-0.nightly-2021-11-10-212548 Updating to 4.10.0-0.nightly-2021-11-10-212548 # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-09-181140 True True 14m Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use # ./oc adm upgrade info: An upgrade is in progress. Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph Channel: stable-4.10 Available Updates: VERSION IMAGE 4.10.0-0.nightly-2021-11-10-212548 registry.ci.openshift.org/ocp/release@sha256:b15acfa35c303c15148e1032774c91df0b38ea2b3efee4d8c408777d64467c70 // Above shows the cluster is still at 4.10.0-0.nightly-2021-11-09-181140 and 4.10.0-0.nightly-2021-11-10-212548 is still in available list. 4. Go on to do upgrade to 4.10.0-0.nightly-2021-11-10-212548. # ./oc adm upgrade --to 4.10.0-0.nightly-2021-11-10-212548 info: Cluster is already at version 4.10.0-0.nightly-2021-11-10-212548 (// Here the information is not correct, which hint a wrong cluster version) 5. Let's go back to 4.10.0-0.nightly-2021-11-09-181140. # ./oc adm upgrade --to 4.10.0-0.nightly-2021-11-09-181140 error: The update 4.10.0-0.nightly-2021-11-09-181140 is not one of the available updates: 4.10.0-0.nightly-2021-11-10-212548 6. Then we can cancel the upgrade(// Could we add the cancel way in above information if users don't know how to get rid of the predicament?) # ./oc adm upgrade --clear Cleared the update field, still at 4.10.0-0.nightly-2021-11-10-212548 (// The version is still not correct, and the information still makes user confused, because the cancel actually happened, and it will go to original version) # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-09-181140 True True 23m Working towards 4.10.0-0.nightly-2021-11-09-181140: 363 of 756 done (48% complete) # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-09-181140 True False 4m12s Cluster version is 4.10.0-0.nightly-2021-11-09-181140 Scenario 2(Corresponding to comment25): 1. Install a v4.10 cluster 2. Check current marketplace-operator deployment # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "25%" } # ./oc logs cluster-version-operator-5d4fc6b786-wsc8x| grep 'Running sync for deployment.*openshift-marketplace'|tail -n1 I1111 08:31:04.024795 1 sync_worker.go:753] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756) 3. Upgrade it to an unsigned payload. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-11-09-181140 True True 14m Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use 4. During above blocked status, change maxUnavailable of marketplace-operator deployment manually # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "50%" } 5. Wait for 10min, check the deployment was not reconciled. # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "50%" } # ./oc logs cluster-version-operator-5d4fc6b786-wsc8x| grep 'Running sync for deployment.*openshift-marketplace'|tail -n1 I1111 08:31:04.024795 1 sync_worker.go:753] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756) Expected action: during the blocked time, it continue to reconcile the old release target since the cluster is still at the original version. Checked the cluster that launched by cluster-bot: 4.10,openshift/cluster-version-operator#683 Scenario 2-verified 1. Check current marketplace-operator deployment # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "25%" } # ./oc logs cluster-version-operator-6cd5c78c4b-gnnmg|grep 'Running sync for deployment.*openshift-marketplace'|tail -n1 I1117 03:02:53.310469 1 sync_worker.go:907] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756) 2. Upgrade it to an unsigned payload. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 0.0.1-0.test-2021-11-17-015750-ci-ln-pb6kg2k-latest True True 41s Unable to apply registry.ci.openshift.org/ocp/release@sha256:1c905c1a02fd2f0ceb65c8e17dd2b6ed40e0b59d07707f8ba4c122b924068107: the image may not be safe to use 3. During above blocked status, change maxUnavailable of marketplace-operator deployment manually # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "50%" } So the issue for 4. Wait for several minutes, check the deployment was reconciled. # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "25%" } # ./oc logs cluster-version-operator-6cd5c78c4b-gnnmg|grep 'Running sync for deployment.*openshift-marketplace'|tail -n1 I1117 03:11:33.593901 1 sync_worker.go:907] Running sync for deployment "openshift-marketplace/marketplace-operator" (584 of 756) So the issue for scenario 2 should be fixed in pr683 now. As for the issue for scenario 1, i think the fix should come from oc. Hi, Jack, would you mind me to file a new bug to track the oc issue for scenario 1, or would you like to keep this bug tracking both of issues? File https://bugzilla.redhat.com/show_bug.cgi?id=2024398 to track the oc adm upgrade issue. According to Comment 28, mark the bug as "tested". (In reply to liujia from comment #29) > File https://bugzilla.redhat.com/show_bug.cgi?id=2024398 to track the oc adm > upgrade issue. > > According to Comment 28, mark the bug as "tested". Hi Jia Liu, Yeah, opening the new bug is fine with me. I'm still working on the original issue but keep getting pulled to other tasks. Pr #683 was updated to WIP for a further work, so the bug need verify again. Remove `Tested` label We're going to target this for 4.11. *** Bug 1826115 has been marked as a duplicate of this bug. *** Tried to verify it on 4.11.0-0.nightly-2022-02-24-054925 following comment 28, but seemd the cvo's behavior was changed in v4.11, so i could not verify it with the steps on v4.10. 1. Upgrade cluster to an unsigned payload. # ./oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway Updating to release image registry.ci.openshift.org/ocp/release@sha256:3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb 2. Check the upgrade status with `oc adm upgrade`, it returns nothing about the upgrade status.(a regression?) # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-02-24-054925 True False 25m Cluster version is 4.11.0-0.nightly-2022-02-24-054925 # ./oc adm upgrade Cluster version is 4.11.0-0.nightly-2022-02-24-054925 Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.11 warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-02-24-054925 not found in the "stable-4.11" channel 3. Check it more in cv.condition to find that the image check failed and the initial upgrade is not in Progressing status. - lastTransitionTime: "2022-03-02T04:51:42Z" message: 'Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb" failure=The update cannot be verified: unable to locate a valid signature for one or more sources' reason: RetrievePayload status: "False" type: ReleaseAccepted ... - lastTransitionTime: "2022-03-02T04:52:40Z" message: Cluster version is 4.11.0-0.nightly-2022-02-24-054925 status: "False" type: Progressing As above, the cluster seems not in a blocked status for precondition check. @Jack Is this expected behavior in v4.11? If so, there will not be blocked status for the cluster. Once the precondition check fail, the update stop and back to reconcile status, right? Anyway, let's continue for a regression test only when it's in reconcile status. 1. Patch maxUnavailable of marketplace-operator deployment # ./oc patch -n openshift-marketplace deployment/marketplace-operator --type=json -p '[{"op": "replace", "path": "/spec/strategy/rollingUpdate/maxUnavailable", "value": "50%"}]' deployment.apps/marketplace-operator patched # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "50%" } 2. wait for several minutes, and check the resource reconciled back to 25% # ./oc -n openshift-marketplace get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "25%" } # ./oc -n openshift-cluster-version logs cluster-version-operator-77479cd88b-pc455|grep 'Running sync for deployment.*openshift-marketplace'|tail -n5 I0302 06:45:04.369528 1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772) I0302 06:49:32.957162 1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772) I0302 06:54:05.395362 1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772) I0302 06:58:33.889240 1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772) I0302 07:03:06.329970 1 sync_worker.go:824] Running sync for deployment "openshift-marketplace/marketplace-operator" (600 of 772) Reconcile works well. (In reply to liujia from comment #39) > 2. Check the upgrade status with `oc adm upgrade`, it returns nothing about > the upgrade status.(a regression?) > ... > - lastTransitionTime: "2022-03-02T04:51:42Z" > message: 'Retrieving payload failed version="" > image="registry.ci.openshift.org/ocp/release@sha256: > 3db6917025ee058bcdbe2a754b4ce702a8cde739d92c8735239f2757a32a4feb" > failure=The update cannot be verified: unable to locate a valid > signature > for one or more sources' > reason: RetrievePayload > status: "False" > type: ReleaseAccepted We should probably teach 'oc adm upgrade' to include the new ReleaseAccepted condition. I don't think that blocks us from verifying this CVO bug, though. > Is this expected behavior in v4.11? This is expected. Per comment 0 and comment 25, the issue this bug was aimed at was the CVO giving up on the current version once it began considering the new version. And your Deployment patch getting stomped shows that that bug has been fixed. > If so, there will not be blocked status for the cluster. Once the precondition check fail, the update stop and back > to reconcile status, right? I'm having trouble parsing this. Can you rephrase? > I'm having trouble parsing this. Can you rephrase?
Let me try to rephrase it in detail.
In v4.10, when trying to upgrade to an unsigned payload, the upgrade will be blocked on the precondition check(with PROGRESSING=True), and the issue is that CVO give up syncing the Deployment while it keeps trying to the new version(PROGRESSING=True).
# ./oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.0-0.nightly-2021-11-09-181140 True True 14m Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be safe to use
Now in v4.11's verification, we still tried to upgrade to an unsigned payload, the upgrade will give up directly(with PROGRESSING=False) due to the precondition check. I'm not sure if we could still call it's in a blocked status because it looks like that cvo also give up to the new version and stay at current version.
# ./oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-02-24-054925 True False 25m Cluster version is 4.11.0-0.nightly-2022-02-24-054925
In this situation, i think it's the same with a normal cluster's reconcile when do Deployment sync. So i wonder if this verify looks good to you.
(In reply to liujia from comment #41) > > I'm having trouble parsing this. Can you rephrase? > Let me try to rephrase it in detail. > > In v4.10, when trying to upgrade to an unsigned payload, the upgrade will be > blocked on the precondition check(with PROGRESSING=True), and the issue is > that CVO give up syncing the Deployment while it keeps trying to the new > version(PROGRESSING=True). > # ./oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING > SINCE STATUS > version 4.10.0-0.nightly-2021-11-09-181140 True True 14m > Unable to apply 4.10.0-0.nightly-2021-11-10-212548: the image may not be > safe to use > > Now in v4.11's verification, we still tried to upgrade to an unsigned > payload, the upgrade will give up directly(with PROGRESSING=False) due to > the precondition check. I'm not sure if we could still call it's in a > blocked status because it looks like that cvo also give up to the new > version and stay at current version. > # ./oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING > SINCE STATUS > version 4.11.0-0.nightly-2022-02-24-054925 True False 25m > Cluster version is 4.11.0-0.nightly-2022-02-24-054925 > > In this situation, i think it's the same with a normal cluster's reconcile > when do Deployment sync. So i wonder if this verify looks good to you. With the new logic and the new ReleaseAccepted condition, CVO never really attempts to do an Upgrade since it fails loading the desired released. So we only set the ReleaseAccepted condition to indicate that the desired release load failed. That's also why the desired release does not show up in the History. > With the new logic and the new ReleaseAccepted condition, CVO never really attempts to do an Upgrade since it fails loading the desired released. So we only set the ReleaseAccepted condition to indicate that the desired release load failed. That's also why the desired release does not show up in the History. So we can say there is not any blocked update status, right? And the verify in comment39 is based on a ReleaseAccepted=false condition. Is that ok to verify the issue? BTW, about the new ReleaseAccepted condition, now we could not get the status from `oc adm upgrade`, is it already tracked somewhere? (In reply to liujia from comment #43) > > With the new logic and the new ReleaseAccepted condition, CVO never really attempts to do an Upgrade since it fails loading the desired released. So we only set the ReleaseAccepted condition to indicate that the desired release load failed. That's also why the desired release does not show up in the History. > > So we can say there is not any blocked update status, right? And the verify > in comment39 is based on a ReleaseAccepted=false condition. Is that ok to > verify the issue? > Yes > BTW, about the new ReleaseAccepted condition, now we could not get the > status from `oc adm upgrade`, is it already tracked somewhere? Created https://issues.redhat.com/browse/OTA-589 According to comment 39 and comment 44, move the bug to verify. case ocp-46017 added, remove tag. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |