Version: 4.9.0-0.nightly-2021-08-07-175228 Platform: baremetal Please specify: IPI What happened? Following these instructions: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-38155&form_mode=edit On a runnign OCP deployment, provisioning cr is deleted; oc delete provisioning provisioning-configuration metal3 pods are removed but baremetal cluster operator remains available What did you expect to happen? metal3 pods are removed but baremetal cluster operator switches to "Available: False" How to reproduce it (as minimally and precisely as possible)? oc delete provisioning provisioning-configuration Must Gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1991641_must-gather.tar.gz
The "baremetal" cluster operator shows the state of cluster-baremetal-operator (CBO) and not the metal3 pod. So, this is working as expected.
@sdasu The issue is that when the provisioning CR is deleted, we were expecting the CBO to move to Available: False. This is indeed the behavior in OCP 4.8: [kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration provisioning.metal3.io "provisioning-configuration" deleted [kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare baremetal 4.8.0-0.nightly-2021-08-05-031749 False False False 14s [kni@provisionhost-0-0 ~]$ In OCP 4.9, CBO remains in state 'Available: True' [kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration provisioning.metal3.io "provisioning-configuration" deleted [kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare baremetal 4.9.0-0.nightly-2021-08-07-175228 True False False 26h [kni@provisionhost-0-0 ~]$
Looking at the 4.8 code https://github.com/openshift/cluster-baremetal-operator/blob/release-4.8/controllers/clusteroperator.go#L210-L226 the only reason I can see for it going into Available=False is that the CO is deleted when we delete the provisioning CR (as there is a controllerRef) and then it gets set to defaults which is Available=False. some thoughts: - should we even set the controller ref? - the check we do here https://github.com/openshift/cluster-baremetal-operator/blob/release-4.8/controllers/clusteroperator.go#L142 is wrong as ClusterVersion puts in a reference like: ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: f5167872-0bc2-45a0-8d74-b5077c69280a
@awolff can you point out to us why you believe that you are "expecting the CBO to move to Available: False" when you delete the provisioning CR? I agree with @sdasu that the clusterOperator conditions refer to the state of the operator, not what it is managing. Cluster version operator watches these conditions and decides whether or not to upgrade the operator...
note: the PR I posted is more of a cleanup/clarification of our current state. Potentially we need a fix to 4.8 to set available=true
Hi @asalkeld. I had gotten the expected behavior from this test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-38155 (unfortunately the person who wrote it is no longer at redhat) and based on the behavior I saw in version 4.8 Is there anywhere else I could see it documented?
Hi @awolff see this PR and the underlying docs https://github.com/openshift/api/pull/1000
Version - Cluster version is 4.10.0-0.nightly-2021-10-02-095441 --------------------------------------------------------------- I checked in 4.10, in my opinion everything remained unchanged, just as it was in 4.9. The question is if this is what we expected or is it still a bug? [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-10-02-095441 True False 3m56s Cluster version is 4.10.0-0.nightly-2021-10-02-095441 [kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration provisioning.metal3.io "provisioning-configuration" deleted [kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare baremetal 4.10.0-0.nightly-2021-10-02-095441 True False False 33m
We see now that if we use '-o yaml', we get more info: oc get clusteroperator baremetal -o yaml . . . . - lastTransitionTime: "2021-10-04T12:09:09Z" message: Provisioning CR not found reason: ProvisioningCRNotFound status: "True" type: Available ... To me the 'reason' sounds like an error message and can therefore be misleading. I'm wondering if it should be set to something in the spirit of; "reason: DeployComplete. Waiting for provisioning CR" Or perhaps even the 'message' is enough to convey the fact that the provisioning CR is missing?
Not changing the reason but can change the message to say "Waiting for Provisioning CR".
Tested on: 4.10.0-0.nightly-2021-11-24-030137 apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-11-24T09:40:13Z" generation: 1 name: baremetal ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: d20bc9c2-a8a9-4deb-b74e-3075d8ccf20a resourceVersion: "55888" uid: 070eff11-8029-45f0-9880-833b06afe508 spec: {} status: conditions: - lastTransitionTime: "2021-11-24T11:50:55Z" reason: WaitingForProvisioningCR status: "False" type: Progressing - lastTransitionTime: "2021-11-24T11:50:55Z" status: "False" type: Degraded - lastTransitionTime: "2021-11-24T10:02:37Z" message: Waiting for Provisioning CR reason: WaitingForProvisioningCR status: "True" type: Available - lastTransitionTime: "2021-11-24T10:01:29Z" status: "True" type: Upgradeable - lastTransitionTime: "2021-11-24T10:01:29Z" status: "False" type: Disabled extension: null relatedObjects: - group: "" name: openshift-machine-api resource: namespaces - group: metal3.io name: "" namespace: openshift-machine-api resource: baremetalhosts - group: metal3.io name: "" resource: provisioning versions: - name: operator version: 4.10.0-0.nightl
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056