2037168 – IBM-specific Deployment manifest for package-server-manager should be excluded on non-IBM cluster-profiles

Bug 2037168 - IBM-specific Deployment manifest for package-server-manager should be excluded on non-IBM cluster-profiles

Summary: IBM-specific Deployment manifest for package-server-manager should be exclude...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Kevin Rizza
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-05 06:49 UTC by W. Trevor King
Modified:	2022-03-10 16:37 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:37:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift operator-framework-olm pull 238	0	None	open	Bug 2037168: Remove incorrect cvo annotations	2022-01-05 15:38:38 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:37:23 UTC

Description W. Trevor King 2022-01-05 06:49:04 UTC

From [1]:

  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    include.release.openshift.io/ibm-cloud-managed: "true"

You want ibm-cloud-managed in that IBM-specific manifest, but you don't want the other two, because they're covered by the sibling, non-IBM manifest [2].  You should at least drop self-managed-high-availability from the IBM-specific manifest, to avoid the self-managed-high-availability cluster-version operator trying to simultaneously reconcile both the IBM-specific and non-IBM manifests for that one deployment.

Depending on how much you want to clean up, you can also drop the unused single-node-developer profile across the board; see [3].

Seems like this affects 4.9 too, and a backport is probably worth the trouble:

$ git grep include.release.openshift.io/self-managed-high-availability origin/release-4.9 -- manifests/ | grep ibm
origin/release-4.9:manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml:    include.release.openshift.io/self-managed-high-availability: "true"

to avoid the CVO flapping the nodeSelector:

$ git checkout origin/release-4.9
$ git --no-pager log -1 --oneline
5fc4c78bb (HEAD, origin/release-4.9) Merge pull request #215 from dinhxuanvu/upgrade-delay-4.9
$ diff -u manifests/0000_50_olm_06-psm-operator.deployment.yaml manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml
--- manifests/0000_50_olm_06-psm-operator.deployment.yaml       2022-01-04 22:34:58.219169459 -0800
+++ manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml     2022-01-04 22:34:58.219169459 -0800
@@ -8,6 +8,7 @@
   annotations:
     include.release.openshift.io/self-managed-high-availability: "true"
     include.release.openshift.io/single-node-developer: "true"
+    include.release.openshift.io/ibm-cloud-managed: "true"
 spec:
   strategy:
     type: RollingUpdate
@@ -64,7 +65,6 @@
           terminationMessagePolicy: FallbackToLogsOnError
       nodeSelector:
         kubernetes.io/os: linux
-        node-role.kubernetes.io/master: ""
       tolerations:
         - effect: NoSchedule
           key: node-role.kubernetes.io/master

Poking at recent 4.9 CI [4]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws/1478247345723281408/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-6f8b969579-q8dx4_cluster-version-operator.log | grep 'Running sync.*in state\|openshift-operator-lifecycle-manager/package-server-manager' | tail
I0104 07:12:57.829476       1 sync_worker.go:542] Running sync 4.9.0-0.nightly-2022-01-04-060802 (force=false) on generation 2 in state Reconciling at attempt 0
I0104 07:13:25.186757       1 sync_worker.go:753] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (547 of 737)
I0104 07:13:25.286909       1 sync_worker.go:765] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (547 of 737)
I0104 07:13:25.286941       1 sync_worker.go:753] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (548 of 737)
I0104 07:13:25.384516       1 sync_worker.go:765] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (548 of 737)
I0104 07:16:16.647386       1 sync_worker.go:542] Running sync 4.9.0-0.nightly-2022-01-04-060802 (force=false) on generation 2 in state Reconciling at attempt 0
I0104 07:16:44.002400       1 sync_worker.go:753] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (547 of 737)
I0104 07:16:44.102762       1 sync_worker.go:765] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (547 of 737)
I0104 07:16:44.102795       1 sync_worker.go:753] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (548 of 737)
I0104 07:16:44.204445       1 sync_worker.go:765] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (548 of 737)

So you're currently not actually getting CVO contention because our nodeSelector merge strategy is "require the cluster to contain everything in the manifest, but do not remove unrecognized entries" [5].  But still, assuming that 4.9 CVO will never become more strict about nodeSelector reconciliation is brittle, and asking the CVO to reconcile the same Deployment twice in each sync cycle isn't very efficient.

[1]: https://github.com/openshift/operator-framework-olm/blame/ca5d761a86bd1556b7bea1250fcd7a02f2fff337/manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml#L9-L10
[2]: https://github.com/openshift/operator-framework-olm/blob/ca5d761a86bd1556b7bea1250fcd7a02f2fff337/manifests/0000_50_olm_06-psm-operator.deployment.yaml#L9-L10
[3]: https://github.com/openshift/cluster-version-operator/pull/685
[4]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws/1478247345723281408
[5]: https://github.com/openshift/cluster-version-operator/blob/a14f4e2b87e04d6b81aaa55890be088281f5a550/lib/resourcemerge/core.go#L50

Comment 3 Jian Zhang 2022-01-11 08:00:59 UTC

[cloud-user@preserve-olm-env jian]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-10-144202   True        False         8h      Cluster version is 4.10.0-0.nightly-2022-01-10-144202

The `single-node-developer` and `self-managed-high-availability` annotations for PSM had been removed, as follows,

[cloud-user@preserve-olm-env jian]$ oc get deployment package-server-manager -o=jsonpath='{.metadata.annotations}'
{"deployment.kubernetes.io/revision":"1","include.release.openshift.io/self-managed-high-availability":"true"}

[cloud-user@preserve-olm-env jian]$ oc get deployment packageserver -o=jsonpath='{.metadata.annotations}'
{"deployment.kubernetes.io/revision":"1"}


[cloud-user@preserve-olm-env jian]$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws/1480634289883189248/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76dfccdf84-bsfpx_cluster-version-operator.log | grep 'Running sync.*in state\|openshift-operator-lifecycle-manager/package-server-manager' | tail
I0110 21:01:49.045384       1 sync_worker.go:771] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:05:11.528899       1 sync_worker.go:546] Running sync 4.10.0-0.ci-2022-01-10-042939 (force=false) on generation 2 in state Reconciling at attempt 0
I0110 21:05:39.847068       1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:05:39.939473       1 sync_worker.go:771] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:09:02.425954       1 sync_worker.go:546] Running sync 4.10.0-0.ci-2022-01-10-042939 (force=false) on generation 2 in state Reconciling at attempt 0
I0110 21:09:30.680512       1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:09:30.780506       1 sync_worker.go:771] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:12:53.266564       1 sync_worker.go:546] Running sync 4.10.0-0.ci-2022-01-10-042939 (force=false) on generation 2 in state Reconciling at attempt 0
I0110 21:13:21.572470       1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)
I0110 21:13:21.671508       1 sync_worker.go:771] Done syncing for deployment "openshift-operator-lifecycle-manager/package-server-manager" (573 of 766)

Looks good to me, verify it.

Comment 6 errata-xmlrpc 2022-03-10 16:37:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.