Description of problem: Due to the change in https://bugzilla.redhat.com/show_bug.cgi?id=2064991, now when triggering an upgrade to an unsigned payload, CVO never really attempts to do an upgrade since it fails loading the desired released. And so on user are not blocked to trigger a new upgrade even they do not run `oc adm upgrade --clear`(before the bug, it could not). But there is a problem if setting spec.overrides under this situation, the cvo will not stop the unmanaged resource reconcile as it's expected. Version-Release number of the following components: 4.10.0-0.nightly-2022-05-24-195211, but it can not reproduce on 4.11.0-0.nightly-2022-05-20-213928 How reproducible: always Steps to Reproduce: 1. Install cluster 4.10.0-0.nightly-2022-05-24-195211 and trigger an upgrade to an v4.11 unsigned payload(no upgrade will happen as expected) 2. When the cluster is in such ReleaseAccepted=False condition, setting spec.overrides as before. # ./oc patch clusterversion version --type=merge -p '{"spec": {"overrides":[{"kind": "Deployment", "name": "network-operator", "namespace": "openshift-network-operator", "unmanaged": true, "group": "apps"}]}}' clusterversion.config.openshift.io/version patched # ./oc get clusterversion -ojson|jq .items[].spec.overrides [ { "group": "apps", "kind": "Deployment", "name": "network-operator", "namespace": "openshift-network-operator", "unmanaged": true } ] # ./oc adm upgrade Cluster version is 4.10.0-0.nightly-2022-05-24-195211 Upgradeable=False Reason: ClusterVersionOverridesSet Message: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. ReleaseAccepted=False Reason: RetrievePayload Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:181f0d8e34498e1ab875ed209076437e7d692038bd8c618476eced3c7f34d65c" failure=The update cannot be verified: unable to locate a valid signature for one or more sources ... Now according to above, deployment/network-operator should be out of control of CVO. 3. Check that cvo still run sync of deployment/network-operator together with managed resource deployment/machine-api-operator.(unexpected) # ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-network-operator/network-operator\""|tail -n2 I0525 08:00:48.414120 1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771) I0525 08:04:05.766488 1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771) # ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-machine-api/machine-api-operator\""|tail -n2 I0525 08:01:16.673722 1 sync_worker.go:840] Done syncing for deployment "openshift-machine-api/machine-api-operator" (204 of 771) I0525 08:04:34.079841 1 sync_worker.go:840] Done syncing for deployment "openshift-machine-api/machine-api-operator" (204 of 771) 4. Try to update the deployment/network-operator to check it further before the update: # ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "25%" } I0525 08:07:23.225479 1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771) Edit maxUnavailable to 50% # ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "50%" } wait for several mins(<5mins), check that cvo restore the value back(unexpected). # ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": "25%", "maxUnavailable": "25%" } # ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-network-operator/network-operator\""|tail -n2 I0525 08:07:23.225479 1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771) I0525 08:10:40.616986 1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771) Actual results: setting overrides does not block unmanaged resource reconcile Expected results: cvo should stop sync the unmanaged resource Additional info: Please attach logs from ansible-playbook with the -vvv flag
Seems related with https://bugzilla.redhat.com/show_bug.cgi?id=2080429, maybe we need a backport?
(In reply to liujia from comment #2) > Seems related with https://bugzilla.redhat.com/show_bug.cgi?id=2080429, > maybe we need a backport? Believe you are correct, backport of relevant piece of https://github.com/openshift/cluster-version-operator/pull/770 needed.
Verified on 4.10.0-0.nightly-2022-06-07-181847 After setting spec.overrides while the cluster is in ReleaseAccepted=False status. before the update: # ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": 0, "maxUnavailable": 1 } Update deployment with maxUnavailable to 50%, # ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": 0, "maxUnavailable": "50%" } wait for several mins(around 10mins), check that cvo does not sync the unmanaged resource. # ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate { "maxSurge": 0, "maxUnavailable": "50%" }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.18 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:4944