Bug 2037168

Summary: IBM-specific Deployment manifest for package-server-manager should be excluded on non-IBM cluster-profiles
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: OLMAssignee: Kevin Rizza <krizza>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium    
Version: 4.9   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:37:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2022-01-05 06:49:04 UTC
From [1]:

  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    include.release.openshift.io/ibm-cloud-managed: "true"

You want ibm-cloud-managed in that IBM-specific manifest, but you don't want the other two, because they're covered by the sibling, non-IBM manifest [2].  You should at least drop self-managed-high-availability from the IBM-specific manifest, to avoid the self-managed-high-availability cluster-version operator trying to simultaneously reconcile both the IBM-specific and non-IBM manifests for that one deployment.

Depending on how much you want to clean up, you can also drop the unused single-node-developer profile across the board; see [3].

Seems like this affects 4.9 too, and a backport is probably worth the trouble:

$ git grep include.release.openshift.io/self-managed-high-availability origin/release-4.9 -- manifests/ | grep ibm
origin/release-4.9:manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml:    include.release.openshift.io/self-managed-high-availability: "true"

to avoid the CVO flapping the nodeSelector:

$ git checkout origin/release-4.9
$ git --no-pager log -1 --oneline
5fc4c78bb (HEAD, origin/release-4.9) Merge pull request #215 from dinhxuanvu/upgrade-delay-4.9
$ diff -u manifests/0000_50_olm_06-psm-operator.deployment.yaml manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml
--- manifests/0000_50_olm_06-psm-operator.deployment.yaml       2022-01-04 22:34:58.219169459 -0800
+++ manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml     2022-01-04 22:34:58.219169459 -0800
@@ -8,6 +8,7 @@
   annotations:
     include.release.openshift.io/self-managed-high-availability: "true"
     include.release.openshift.io/single-node-developer: "true"
+    include.release.openshift.io/ibm-cloud-managed: "true"
 spec:
   strategy:
     type: RollingUpdate
@@ -64,7 +65,6 @@
           terminationMessagePolicy: FallbackToLogsOnError
       nodeSelector:
         kubernetes.io/os: linux
-        node-role.kubernetes.io/master: ""
       tolerations:
         - effect: NoSchedule
           key: node-role.kubernetes.io/master

Poking at recent 4.9 CI [4]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws/1478247345723281408/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-6f8b969579-q8dx4_cluster-version-operator.log | grep 'Running sync.*in state\|openshift-operator-lifecycle-manager/package-server-manager' | tail
I0104 07:12:57.829476       1 sync_worker.go:542] Running sync 4.9.0-0.nightly-2022-01-04-060802 (force=false) on generation 2 in state Reconciling at attempt 0
I0104 07:13:25.186757       1 sync_worker.go:753] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (547 of 737)
I0104 07:13:25.286909       1 sync_worker.go:765] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (547 of 737)
I0104 07:13:25.286941       1 sync_worker.go:753] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (548 of 737)
I0104 07:13:25.384516       1 sync_worker.go:765] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (548 of 737)
I0104 07:16:16.647386       1 sync_worker.go:542] Running sync 4.9.0-0.nightly-2022-01-04-060802 (force=false) on generation 2 in state Reconciling at attempt 0
I0104 07:16:44.002400       1 sync_worker.go:753] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (547 of 737)
I0104 07:16:44.102762       1 sync_worker.go:765] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (547 of 737)
I0104 07:16:44.102795       1 sync_worker.go:753] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (548 of 737)
I0104 07:16:44.204445       1 sync_worker.go:765] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (548 of 737)

So you're currently not actually getting CVO contention because our nodeSelector merge strategy is "require the cluster to contain everything in the manifest, but do not remove unrecognized entries" [5].  But still, assuming that 4.9 CVO will never become more strict about nodeSelector reconciliation is brittle, and asking the CVO to reconcile the same Deployment twice in each sync cycle isn't very efficient.

[1]: https://github.com/openshift/operator-framework-olm/blame/ca5d761a86bd1556b7bea1250fcd7a02f2fff337/manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml#L9-L10
[2]: https://github.com/openshift/operator-framework-olm/blob/ca5d761a86bd1556b7bea1250fcd7a02f2fff337/manifests/0000_50_olm_06-psm-operator.deployment.yaml#L9-L10
[3]: https://github.com/openshift/cluster-version-operator/pull/685
[4]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws/1478247345723281408
[5]: https://github.com/openshift/cluster-version-operator/blob/a14f4e2b87e04d6b81aaa55890be088281f5a550/lib/resourcemerge/core.go#L50

Comment 3 Jian Zhang 2022-01-11 08:00:59 UTC
[cloud-user@preserve-olm-env jian]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-10-144202   True        False         8h      Cluster version is 4.10.0-0.nightly-2022-01-10-144202

The `single-node-developer` and `self-managed-high-availability` annotations for PSM had been removed, as follows,

[cloud-user@preserve-olm-env jian]$ oc get deployment package-server-manager -o=jsonpath='{.metadata.annotations}'
{"deployment.kubernetes.io/revision":"1","include.release.openshift.io/self-managed-high-availability":"true"}

[cloud-user@preserve-olm-env jian]$ oc get deployment packageserver -o=jsonpath='{.metadata.annotations}'
{"deployment.kubernetes.io/revision":"1"}


[cloud-user@preserve-olm-env jian]$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws/1480634289883189248/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76dfccdf84-bsfpx_cluster-version-operator.log | grep 'Running sync.*in state\|openshift-operator-lifecycle-manager/package-server-manager' | tail
I0110 21:01:49.045384       1 sync_worker.go:771] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:05:11.528899       1 sync_worker.go:546] Running sync 4.10.0-0.ci-2022-01-10-042939 (force=false) on generation 2 in state Reconciling at attempt 0
I0110 21:05:39.847068       1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:05:39.939473       1 sync_worker.go:771] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:09:02.425954       1 sync_worker.go:546] Running sync 4.10.0-0.ci-2022-01-10-042939 (force=false) on generation 2 in state Reconciling at attempt 0
I0110 21:09:30.680512       1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:09:30.780506       1 sync_worker.go:771] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:12:53.266564       1 sync_worker.go:546] Running sync 4.10.0-0.ci-2022-01-10-042939 (force=false) on generation 2 in state Reconciling at attempt 0
I0110 21:13:21.572470       1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:13:21.671508       1 sync_worker.go:771] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)

Looks good to me, verify it.

Comment 6 errata-xmlrpc 2022-03-10 16:37:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056