2090150 – setting overrides while ReleaseAccepted=False does not stop unmanaged resource reconcile

Bug 2090150 - setting overrides while ReleaseAccepted=False does not stop unmanaged resource reconcile

Summary: setting overrides while ReleaseAccepted=False does not stop unmanaged resourc...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Jack Ottofaro
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:	2080429
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-25 09:04 UTC by liujia
Modified:	2022-08-11 12:37 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-13 14:38:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 782	0	None	open	Bug 2090150: pkg/cvo/sync_worker.go: Save overrides	2022-05-25 16:54:03 UTC
Red Hat Product Errata	RHBA-2022:4944	0	None	None	None	2022-06-13 14:39:07 UTC

Description liujia 2022-05-25 09:04:46 UTC

Description of problem:
Due to the change in https://bugzilla.redhat.com/show_bug.cgi?id=2064991, now when triggering an upgrade to an unsigned payload, CVO never really attempts to do an upgrade since it fails loading the desired released. And so on user are not blocked to trigger a new upgrade even they do not run `oc adm upgrade --clear`(before the bug, it could not).

But there is a problem if setting spec.overrides under this situation, the cvo will not stop the unmanaged resource reconcile as it's expected.

Version-Release number of the following components:
4.10.0-0.nightly-2022-05-24-195211,
but it can not reproduce on 4.11.0-0.nightly-2022-05-20-213928

How reproducible:
always

Steps to Reproduce:
1. Install cluster 4.10.0-0.nightly-2022-05-24-195211 and trigger an upgrade to an v4.11 unsigned payload(no upgrade will happen as expected)

2. When the cluster is in such ReleaseAccepted=False condition, setting spec.overrides as before.
# ./oc patch clusterversion version --type=merge -p '{"spec": {"overrides":[{"kind": "Deployment", "name": "network-operator", "namespace": "openshift-network-operator", "unmanaged": true, "group": "apps"}]}}'
clusterversion.config.openshift.io/version patched

# ./oc get clusterversion -ojson|jq .items[].spec.overrides
[
  {
    "group": "apps",
    "kind": "Deployment",
    "name": "network-operator",
    "namespace": "openshift-network-operator",
    "unmanaged": true
  }
]

# ./oc adm upgrade
Cluster version is 4.10.0-0.nightly-2022-05-24-195211

Upgradeable=False

  Reason: ClusterVersionOverridesSet
  Message: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:181f0d8e34498e1ab875ed209076437e7d692038bd8c618476eced3c7f34d65c" failure=The update cannot be verified: unable to locate a valid signature for one or more sources
...

Now according to above, deployment/network-operator should be out of control of CVO.

3. Check that cvo still run sync of deployment/network-operator together with managed resource deployment/machine-api-operator.(unexpected)
# ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-network-operator/network-operator\""|tail -n2
I0525 08:00:48.414120       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)
I0525 08:04:05.766488       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)


# ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-machine-api/machine-api-operator\""|tail -n2
I0525 08:01:16.673722       1 sync_worker.go:840] Done syncing for deployment "openshift-machine-api/machine-api-operator" (204 of 771)
I0525 08:04:34.079841       1 sync_worker.go:840] Done syncing for deployment "openshift-machine-api/machine-api-operator" (204 of 771)

4. Try to update the deployment/network-operator to check it further
before the update:
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "25%"
}
I0525 08:07:23.225479       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)

Edit maxUnavailable to 50%
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "50%"
}

wait for several mins(<5mins), check that cvo restore the value back(unexpected).
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "25%"
}
# ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-network-operator/network-operator\""|tail -n2
I0525 08:07:23.225479       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)
I0525 08:10:40.616986       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)


Actual results:
setting overrides does not block unmanaged resource reconcile

Expected results:
cvo should stop sync the unmanaged resource

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 liujia 2022-05-25 09:15:47 UTC

Seems related with https://bugzilla.redhat.com/show_bug.cgi?id=2080429, maybe we need a backport?

Comment 3 Jack Ottofaro 2022-05-25 15:44:53 UTC

(In reply to liujia from comment #2)
> Seems related with https://bugzilla.redhat.com/show_bug.cgi?id=2080429,
> maybe we need a backport?

Believe you are correct, backport of relevant piece of https://github.com/openshift/cluster-version-operator/pull/770 needed.

Comment 6 liujia 2022-06-08 04:45:23 UTC

Verified on 4.10.0-0.nightly-2022-06-07-181847

After setting spec.overrides while the cluster is in ReleaseAccepted=False status.

before the update:
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": 0,
  "maxUnavailable": 1
}

Update deployment with maxUnavailable to 50%,
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": 0,
  "maxUnavailable": "50%"
}

wait for several mins(around 10mins), check that cvo does not sync the unmanaged resource.
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": 0,
  "maxUnavailable": "50%"
}

Comment 9 errata-xmlrpc 2022-06-13 14:38:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:4944

Note You need to log in before you can comment on or make changes to this bug.