Description of problem: From https://mailman-int.corp.redhat.com/archives/aos-devel/2021-May/msg00139.html: tl;dr: Folks may want to set relatedObjects in their ClusterOperator manifests to help must-gather when their operator fails to come up. Over a year ago, the cluster-version operator began pre-creating ClusterOperator manifests [1] to move us from: 1. CVO creates a namespace for an operator. 2. CVO creates ... for the operator. 3. CVO creates the operator Deployment. 4. Operator deployment never comes up, for whatever reason. 5. Admin must-gathers. 6. Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up. to: 1. CVO pre-creates ClusterOperator for an operator. 2. CVO creates the namespace for an operator. 3. CVO creates ... for the operator. 4. CVO creates the operator Deployment. 5. Operator deployment never comes up, for whatever reason. 6. Admin must-gathers. 7. Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue. We've recently tweaked that to avoid some ClusterOperatorDown and ClusterOperatorDegraded on updates [2], which got me looking into this space again. Auditing 4.8.0-fc.4: $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.8.0-fc.4-x86_64 $ for X in manifests/*.yaml; do yaml2json < "${X}" | jq -r '.[] | select(.kind == "ClusterOperator") | (.status.relatedObjects // [] | length | tostring) as $r | $r + " " + .metadata.name'; done | sort -n 0 authentication 0 baremetal 0 cloud-credential 0 cluster-autoscaler 0 config-operator 0 console 0 csi-snapshot-controller 0 dns 0 image-registry 0 ingress 0 insights 0 machine-api 0 machine-approver 0 marketplace 0 monitoring 0 network 0 node-tuning 0 openshift-apiserver 0 openshift-controller-manager 0 openshift-samples 0 operator-lifecycle-manager 0 operator-lifecycle-manager-catalog 0 operator-lifecycle-manager-packageserver 0 service-ca 0 storage 4 kube-storage-version-migrator 5 etcd 7 kube-controller-manager 7 kube-scheduler 7 machine-config 11 kube-apiserver The ClusterOperator without relatedObjects in their manifest are vulnerable to must-gathers which lack information needed to debug why they are failing to install. That's not a big deal; I don't hear all that many reports about 2nd-level operator pods failing to come up. But it's a pretty straightforward change to include some static references in your ClusterOperator manifests just in case. It doesn't have to be complete; obviously if you have any references with dynamic names, those will have to wait on your operator. But things like namespaces and such have static names and can be rolled into the manifest like [3]. Much more important is having your relatedObjects filled out by your operator, because the CVO's pre-creation is only at create-time. You want to make sure your operator is actively managing this property, both for the "adding/removing a related object on update" use-case and for the "stomping crazy admins messing with your ClusterOperator status" use-case. And because you want your references to be complete enough for must-gather to be able to find your resources. For example, I recently noticed that the storage ClusterOperator is not referencing its ClusterRoleBindings [4]. It's harder to audit whether coverage is complete, but it's worth thinking over if you do decide to revisit your relatedObjects. Cheers, Trevor [1]: https://github.com/openshift/cluster-version-operator/pull/318 [2]: https://github.com/openshift/cluster-version-operator/pull/553 [3]: https://github.com/openshift/cluster-etcd-operator/blob/5fc1bdc7b666f499b775a627da97b4bc536e2211/manifests/0000_12_etcd-operator_07_clusteroperator.yaml#L16-L31 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1961317 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Adding the relatedObjects directly to the baremetal ClusterOperator's manifest should take care of this. At this time, it seems unnecessary to back port this fix to older releases. So, the fix is currently only targeted for 4.11.
You can look at non-baremetal CI run to see if the "baremetal" ClusterOperator has relatedObjects set.
It is not clear from the above output if /home/kni/manifests/0000_31_cluster-baremetal-operator_07_clusteroperator.cr.yaml was indeed extracted from the image. ---------------------------------------------------------------------------------------------------------------------------------------------------- I extracted must-gather.tar from this image : https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws/1526797967619526656/. Note that this is a non-baremetal image. Then I proceeded to extract the baremetal ClusterOperator from this image: ------------------------------------------------------------------------------- [sdasu@sdasu Downloads]$ cat quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5aba938c92897f86ce705a3b38a677e4e8bb4a4e900da50250dc7d977d74e3fb/cluster-scoped-resources/config.openshift.io/clusteroperators/baremetal.yaml --- apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: capability.openshift.io/name: baremetal exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2022-05-18T05:44:10Z" generation: 1 managedFields: - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:capability.openshift.io/name: {} f:exclude.release.openshift.io/internal-openshift-hosted: {} f:include.release.openshift.io/self-managed-high-availability: {} f:include.release.openshift.io/single-node-developer: {} f:ownerReferences: .: {} k:{"uid":"3106ebed-d1da-4d12-bd89-eb3d740ab598"}: {} f:spec: {} manager: Go-http-client operation: Update time: "2022-05-18T05:44:10Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:status: .: {} f:extension: {} f:relatedObjects: {} manager: Go-http-client operation: Update subresource: status time: "2022-05-18T05:44:10Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:status: f:conditions: {} f:versions: {} manager: cluster-baremetal-operator operation: Update subresource: status time: "2022-05-18T05:47:03Z" name: baremetal ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 3106ebed-d1da-4d12-bd89-eb3d740ab598 resourceVersion: "8363" uid: 521cffda-afd5-4110-93e2-74ed50211acf spec: {} status: conditions: - lastTransitionTime: "2022-05-18T05:47:03Z" status: "False" type: Progressing - lastTransitionTime: "2022-05-18T05:47:03Z" status: "False" type: Degraded - lastTransitionTime: "2022-05-18T05:47:03Z" message: Operational reason: AsExpected status: "True" type: Available - lastTransitionTime: "2022-05-18T05:47:03Z" status: "True" type: Upgradeable - lastTransitionTime: "2022-05-18T05:47:03Z" message: Nothing to do on this Platform reason: UnsupportedPlatform status: "True" type: Disabled extension: null relatedObjects: - group: "" name: openshift-machine-api resource: namespaces - group: metal3.io name: "" namespace: openshift-machine-api resource: baremetalhosts - group: metal3.io name: "" resource: provisioning - group: metal3.io name: "" namespace: openshift-machine-api resource: hostfirmwaresettings - group: metal3.io name: "" namespace: openshift-machine-api resource: firmwareschemas - group: metal3.io name: "" namespace: openshift-machine-api resource: preprovisioningimages - group: metal3.io name: "" namespace: openshift-machine-api resource: bmceventsubscriptions versions: - name: operator version: 4.11.0-0.nightly-2022-05-18-053037
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069