Bug 1961844 - baremetal ClusterOperator installed by CVO does not have relatedObjects
Summary: baremetal ClusterOperator installed by CVO does not have relatedObjects
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.11.0
Assignee: sdasu
QA Contact: Eldar Weiss
Depends On:
TreeView+ depends on / blocked
Reported: 2021-05-18 19:56 UTC by sdasu
Modified: 2022-08-10 10:36 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2022-08-10 10:36:25 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-baremetal-operator pull 255 0 None open Bug 1961844: Adding baremetal ClusterOperator relatedObjects directly to its manifest 2022-03-28 20:59:49 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:36:35 UTC

Description sdasu 2021-05-18 19:56:39 UTC
Description of problem:
From https://mailman-int.corp.redhat.com/archives/aos-devel/2021-May/msg00139.html:

tl;dr: Folks may want to set relatedObjects in their ClusterOperator
manifests to help must-gather when their operator fails to come up.

Over a year ago, the cluster-version operator began pre-creating
ClusterOperator manifests [1] to move us from:

1. CVO creates a namespace for an operator.
2. CVO creates ... for the operator.
3. CVO creates the operator Deployment.
4. Operator deployment never comes up, for whatever reason.
5. Admin must-gathers.
6. Must gather uses ClusterOperators for discovering important stuff,
   and because the ClusterOperator doesn't exist yet, we get no data
   about why the deployment didn't come up.


1. CVO pre-creates ClusterOperator for an operator.
2. CVO creates the namespace for an operator.
3. CVO creates ... for the operator.
4. CVO creates the operator Deployment.
5. Operator deployment never comes up, for whatever reason.
6. Admin must-gathers.
7. Must gather uses ClusterOperators for discovering important stuff,
and finds the one the CVO had pre-created with hard-coded
relatedObjects, gathers stuff from the referenced operator namespace,
and allows us to trouble-shoot the issue.

We've recently tweaked that to avoid some ClusterOperatorDown and
ClusterOperatorDegraded on updates [2], which got me looking into this
space again.  Auditing 4.8.0-fc.4:

$ oc adm release extract --to manifests
$ for X in manifests/*.yaml; do yaml2json < "${X}" | jq -r '.[] |
select(.kind == "ClusterOperator") | (.status.relatedObjects // [] |
length | tostring) as $r | $r + " " + .metadata.name'; done | sort -n
0 authentication
0 baremetal
0 cloud-credential
0 cluster-autoscaler
0 config-operator
0 console
0 csi-snapshot-controller
0 dns
0 image-registry
0 ingress
0 insights
0 machine-api
0 machine-approver
0 marketplace
0 monitoring
0 network
0 node-tuning
0 openshift-apiserver
0 openshift-controller-manager
0 openshift-samples
0 operator-lifecycle-manager
0 operator-lifecycle-manager-catalog
0 operator-lifecycle-manager-packageserver
0 service-ca
0 storage
4 kube-storage-version-migrator
5 etcd
7 kube-controller-manager
7 kube-scheduler
7 machine-config
11 kube-apiserver

The ClusterOperator without relatedObjects in their manifest are
vulnerable to must-gathers which lack information needed to debug why
they are failing to install.  That's not a big deal; I don't hear all
that many reports about 2nd-level operator pods failing to come up.
But it's a pretty straightforward change to include some static
references in your ClusterOperator manifests just in case.  It doesn't
have to be complete; obviously if you have any references with dynamic
names, those will have to wait on your operator.  But things like
namespaces and such have static names and can be rolled into the
manifest like [3].

Much more important is having your relatedObjects filled out by your
operator, because the CVO's pre-creation is only at create-time.  You
want to make sure your operator is actively managing this property,
both for the "adding/removing a related object on update" use-case and
for the "stomping crazy admins messing with your ClusterOperator
status" use-case.  And because you want your references to be complete
enough for must-gather to be able to find your resources.  For
example, I recently noticed that the storage ClusterOperator is not
referencing its ClusterRoleBindings [4].  It's harder to audit whether
coverage is complete, but it's worth thinking over if you do decide to
revisit your relatedObjects.


[1]: https://github.com/openshift/cluster-version-operator/pull/318
[2]: https://github.com/openshift/cluster-version-operator/pull/553
[3]: https://github.com/openshift/cluster-etcd-operator/blob/5fc1bdc7b666f499b775a627da97b4bc536e2211/manifests/0000_12_etcd-operator_07_clusteroperator.yaml#L16-L31
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1961317

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Comment 3 sdasu 2022-03-25 15:56:19 UTC
Adding the relatedObjects directly to the baremetal ClusterOperator's manifest should take care of this.
At this time, it seems unnecessary to back port this fix to older releases. So, the fix is currently only targeted for 4.11.

Comment 7 sdasu 2022-05-13 20:52:17 UTC
You can look at non-baremetal CI run to see if the "baremetal" ClusterOperator has relatedObjects set.

Comment 9 sdasu 2022-05-23 15:19:56 UTC
It is not clear from the above output if /home/kni/manifests/0000_31_cluster-baremetal-operator_07_clusteroperator.cr.yaml was indeed extracted from the image.

I extracted must-gather.tar from this image : https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws/1526797967619526656/. Note that this is a non-baremetal image.

Then I proceeded to extract the baremetal ClusterOperator from this image:

[sdasu@sdasu Downloads]$ cat quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5aba938c92897f86ce705a3b38a677e4e8bb4a4e900da50250dc7d977d74e3fb/cluster-scoped-resources/config.openshift.io/clusteroperators/baremetal.yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
    capability.openshift.io/name: baremetal
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-05-18T05:44:10Z"
  generation: 1
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
          .: {}
          f:capability.openshift.io/name: {}
          f:exclude.release.openshift.io/internal-openshift-hosted: {}
          f:include.release.openshift.io/self-managed-high-availability: {}
          f:include.release.openshift.io/single-node-developer: {}
          .: {}
          k:{"uid":"3106ebed-d1da-4d12-bd89-eb3d740ab598"}: {}
      f:spec: {}
    manager: Go-http-client
    operation: Update
    time: "2022-05-18T05:44:10Z"
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
        .: {}
        f:extension: {}
        f:relatedObjects: {}
    manager: Go-http-client
    operation: Update
    subresource: status
    time: "2022-05-18T05:44:10Z"
  - apiVersion: config.openshift.io/v1
    fieldsType: FieldsV1
        f:conditions: {}
        f:versions: {}
    manager: cluster-baremetal-operator
    operation: Update
    subresource: status
    time: "2022-05-18T05:47:03Z"
  name: baremetal
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 3106ebed-d1da-4d12-bd89-eb3d740ab598
  resourceVersion: "8363"
  uid: 521cffda-afd5-4110-93e2-74ed50211acf
spec: {}
  - lastTransitionTime: "2022-05-18T05:47:03Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-05-18T05:47:03Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2022-05-18T05:47:03Z"
    message: Operational
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2022-05-18T05:47:03Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2022-05-18T05:47:03Z"
    message: Nothing to do on this Platform
    reason: UnsupportedPlatform
    status: "True"
    type: Disabled
  extension: null
  - group: ""
    name: openshift-machine-api
    resource: namespaces
  - group: metal3.io
    name: ""
    namespace: openshift-machine-api
    resource: baremetalhosts
  - group: metal3.io
    name: ""
    resource: provisioning
  - group: metal3.io
    name: ""
    namespace: openshift-machine-api
    resource: hostfirmwaresettings
  - group: metal3.io
    name: ""
    namespace: openshift-machine-api
    resource: firmwareschemas
  - group: metal3.io
    name: ""
    namespace: openshift-machine-api
    resource: preprovisioningimages
  - group: metal3.io
    name: ""
    namespace: openshift-machine-api
    resource: bmceventsubscriptions
  - name: operator
    version: 4.11.0-0.nightly-2022-05-18-053037

Comment 12 errata-xmlrpc 2022-08-10 10:36:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.