Bug 2090150 - setting overrides while ReleaseAccepted=False does not stop unmanaged resource reconcile
Summary: setting overrides while ReleaseAccepted=False does not stop unmanaged resourc...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.z
Assignee: Jack Ottofaro
QA Contact: liujia
URL:
Whiteboard:
Depends On: 2080429
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-25 09:04 UTC by liujia
Modified: 2022-08-11 12:37 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-13 14:38:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 782 0 None open Bug 2090150: pkg/cvo/sync_worker.go: Save overrides 2022-05-25 16:54:03 UTC
Red Hat Product Errata RHBA-2022:4944 0 None None None 2022-06-13 14:39:07 UTC

Description liujia 2022-05-25 09:04:46 UTC
Description of problem:
Due to the change in https://bugzilla.redhat.com/show_bug.cgi?id=2064991, now when triggering an upgrade to an unsigned payload, CVO never really attempts to do an upgrade since it fails loading the desired released. And so on user are not blocked to trigger a new upgrade even they do not run `oc adm upgrade --clear`(before the bug, it could not).

But there is a problem if setting spec.overrides under this situation, the cvo will not stop the unmanaged resource reconcile as it's expected.

Version-Release number of the following components:
4.10.0-0.nightly-2022-05-24-195211,
but it can not reproduce on 4.11.0-0.nightly-2022-05-20-213928

How reproducible:
always

Steps to Reproduce:
1. Install cluster 4.10.0-0.nightly-2022-05-24-195211 and trigger an upgrade to an v4.11 unsigned payload(no upgrade will happen as expected)

2. When the cluster is in such ReleaseAccepted=False condition, setting spec.overrides as before.
# ./oc patch clusterversion version --type=merge -p '{"spec": {"overrides":[{"kind": "Deployment", "name": "network-operator", "namespace": "openshift-network-operator", "unmanaged": true, "group": "apps"}]}}'
clusterversion.config.openshift.io/version patched

# ./oc get clusterversion -ojson|jq .items[].spec.overrides
[
  {
    "group": "apps",
    "kind": "Deployment",
    "name": "network-operator",
    "namespace": "openshift-network-operator",
    "unmanaged": true
  }
]

# ./oc adm upgrade
Cluster version is 4.10.0-0.nightly-2022-05-24-195211

Upgradeable=False

  Reason: ClusterVersionOverridesSet
  Message: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:181f0d8e34498e1ab875ed209076437e7d692038bd8c618476eced3c7f34d65c" failure=The update cannot be verified: unable to locate a valid signature for one or more sources
...

Now according to above, deployment/network-operator should be out of control of CVO.

3. Check that cvo still run sync of deployment/network-operator together with managed resource deployment/machine-api-operator.(unexpected)
# ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-network-operator/network-operator\""|tail -n2
I0525 08:00:48.414120       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)
I0525 08:04:05.766488       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)


# ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-machine-api/machine-api-operator\""|tail -n2
I0525 08:01:16.673722       1 sync_worker.go:840] Done syncing for deployment "openshift-machine-api/machine-api-operator" (204 of 771)
I0525 08:04:34.079841       1 sync_worker.go:840] Done syncing for deployment "openshift-machine-api/machine-api-operator" (204 of 771)

4. Try to update the deployment/network-operator to check it further
before the update:
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "25%"
}
I0525 08:07:23.225479       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)

Edit maxUnavailable to 50%
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "50%"
}

wait for several mins(<5mins), check that cvo restore the value back(unexpected).
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "25%"
}
# ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-network-operator/network-operator\""|tail -n2
I0525 08:07:23.225479       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)
I0525 08:10:40.616986       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)


Actual results:
setting overrides does not block unmanaged resource reconcile

Expected results:
cvo should stop sync the unmanaged resource

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 liujia 2022-05-25 09:15:47 UTC
Seems related with https://bugzilla.redhat.com/show_bug.cgi?id=2080429, maybe we need a backport?

Comment 3 Jack Ottofaro 2022-05-25 15:44:53 UTC
(In reply to liujia from comment #2)
> Seems related with https://bugzilla.redhat.com/show_bug.cgi?id=2080429,
> maybe we need a backport?

Believe you are correct, backport of relevant piece of https://github.com/openshift/cluster-version-operator/pull/770 needed.

Comment 6 liujia 2022-06-08 04:45:23 UTC
Verified on 4.10.0-0.nightly-2022-06-07-181847

After setting spec.overrides while the cluster is in ReleaseAccepted=False status.

before the update:
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": 0,
  "maxUnavailable": 1
}

Update deployment with maxUnavailable to 50%,
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": 0,
  "maxUnavailable": "50%"
}

wait for several mins(around 10mins), check that cvo does not sync the unmanaged resource.
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": 0,
  "maxUnavailable": "50%"
}

Comment 9 errata-xmlrpc 2022-06-13 14:38:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:4944


Note You need to log in before you can comment on or make changes to this bug.