Bug 2090150

Summary: setting overrides while ReleaseAccepted=False does not stop unmanaged resource reconcile
Product: OpenShift Container Platform Reporter: liujia <jiajliu>
Component: Cluster Version OperatorAssignee: Jack Ottofaro <jack.ottofaro>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.10CC: aos-team-ota, jack.ottofaro
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-13 14:38:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2080429    
Bug Blocks:    

Description liujia 2022-05-25 09:04:46 UTC
Description of problem:
Due to the change in https://bugzilla.redhat.com/show_bug.cgi?id=2064991, now when triggering an upgrade to an unsigned payload, CVO never really attempts to do an upgrade since it fails loading the desired released. And so on user are not blocked to trigger a new upgrade even they do not run `oc adm upgrade --clear`(before the bug, it could not).

But there is a problem if setting spec.overrides under this situation, the cvo will not stop the unmanaged resource reconcile as it's expected.

Version-Release number of the following components:
4.10.0-0.nightly-2022-05-24-195211,
but it can not reproduce on 4.11.0-0.nightly-2022-05-20-213928

How reproducible:
always

Steps to Reproduce:
1. Install cluster 4.10.0-0.nightly-2022-05-24-195211 and trigger an upgrade to an v4.11 unsigned payload(no upgrade will happen as expected)

2. When the cluster is in such ReleaseAccepted=False condition, setting spec.overrides as before.
# ./oc patch clusterversion version --type=merge -p '{"spec": {"overrides":[{"kind": "Deployment", "name": "network-operator", "namespace": "openshift-network-operator", "unmanaged": true, "group": "apps"}]}}'
clusterversion.config.openshift.io/version patched

# ./oc get clusterversion -ojson|jq .items[].spec.overrides
[
  {
    "group": "apps",
    "kind": "Deployment",
    "name": "network-operator",
    "namespace": "openshift-network-operator",
    "unmanaged": true
  }
]

# ./oc adm upgrade
Cluster version is 4.10.0-0.nightly-2022-05-24-195211

Upgradeable=False

  Reason: ClusterVersionOverridesSet
  Message: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:181f0d8e34498e1ab875ed209076437e7d692038bd8c618476eced3c7f34d65c" failure=The update cannot be verified: unable to locate a valid signature for one or more sources
...

Now according to above, deployment/network-operator should be out of control of CVO.

3. Check that cvo still run sync of deployment/network-operator together with managed resource deployment/machine-api-operator.(unexpected)
# ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-network-operator/network-operator\""|tail -n2
I0525 08:00:48.414120       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)
I0525 08:04:05.766488       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)


# ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-machine-api/machine-api-operator\""|tail -n2
I0525 08:01:16.673722       1 sync_worker.go:840] Done syncing for deployment "openshift-machine-api/machine-api-operator" (204 of 771)
I0525 08:04:34.079841       1 sync_worker.go:840] Done syncing for deployment "openshift-machine-api/machine-api-operator" (204 of 771)

4. Try to update the deployment/network-operator to check it further
before the update:
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "25%"
}
I0525 08:07:23.225479       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)

Edit maxUnavailable to 50%
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "50%"
}

wait for several mins(<5mins), check that cvo restore the value back(unexpected).
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": "25%",
  "maxUnavailable": "25%"
}
# ./oc -n openshift-cluster-version logs cluster-version-operator-744c695494-whhdc|grep "Done syncing for deployment \"openshift-network-operator/network-operator\""|tail -n2
I0525 08:07:23.225479       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)
I0525 08:10:40.616986       1 sync_worker.go:840] Done syncing for deployment "openshift-network-operator/network-operator" (620 of 771)


Actual results:
setting overrides does not block unmanaged resource reconcile

Expected results:
cvo should stop sync the unmanaged resource

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 liujia 2022-05-25 09:15:47 UTC
Seems related with https://bugzilla.redhat.com/show_bug.cgi?id=2080429, maybe we need a backport?

Comment 3 Jack Ottofaro 2022-05-25 15:44:53 UTC
(In reply to liujia from comment #2)
> Seems related with https://bugzilla.redhat.com/show_bug.cgi?id=2080429,
> maybe we need a backport?

Believe you are correct, backport of relevant piece of https://github.com/openshift/cluster-version-operator/pull/770 needed.

Comment 6 liujia 2022-06-08 04:45:23 UTC
Verified on 4.10.0-0.nightly-2022-06-07-181847

After setting spec.overrides while the cluster is in ReleaseAccepted=False status.

before the update:
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": 0,
  "maxUnavailable": 1
}

Update deployment with maxUnavailable to 50%,
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": 0,
  "maxUnavailable": "50%"
}

wait for several mins(around 10mins), check that cvo does not sync the unmanaged resource.
# ./oc -n openshift-network-operator get deployment -ojson|jq .items[].spec.strategy.rollingUpdate
{
  "maxSurge": 0,
  "maxUnavailable": "50%"
}

Comment 9 errata-xmlrpc 2022-06-13 14:38:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:4944