Bug 1991641
Summary: | Baremetal Cluster Operator still Available After Delete Provisioning | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Adina Wolff <awolff> |
Component: | Installer | Assignee: | sdasu |
Installer sub component: | OpenShift on Bare Metal IPI | QA Contact: | Adina Wolff <awolff> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | amalykhi, lshilin, prabinov, sdasu, wking |
Version: | 4.9 | Keywords: | Reopened, Triaged |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-12 04:37:24 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2023604 | ||
Bug Blocks: | 2011824 |
Description
Adina Wolff
2021-08-09 15:04:00 UTC
The "baremetal" cluster operator shows the state of cluster-baremetal-operator (CBO) and not the metal3 pod. So, this is working as expected. @sdasu The issue is that when the provisioning CR is deleted, we were expecting the CBO to move to Available: False. This is indeed the behavior in OCP 4.8: [kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration provisioning.metal3.io "provisioning-configuration" deleted [kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare baremetal 4.8.0-0.nightly-2021-08-05-031749 False False False 14s [kni@provisionhost-0-0 ~]$ In OCP 4.9, CBO remains in state 'Available: True' [kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration provisioning.metal3.io "provisioning-configuration" deleted [kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare baremetal 4.9.0-0.nightly-2021-08-07-175228 True False False 26h [kni@provisionhost-0-0 ~]$ Looking at the 4.8 code https://github.com/openshift/cluster-baremetal-operator/blob/release-4.8/controllers/clusteroperator.go#L210-L226 the only reason I can see for it going into Available=False is that the CO is deleted when we delete the provisioning CR (as there is a controllerRef) and then it gets set to defaults which is Available=False. some thoughts: - should we even set the controller ref? - the check we do here https://github.com/openshift/cluster-baremetal-operator/blob/release-4.8/controllers/clusteroperator.go#L142 is wrong as ClusterVersion puts in a reference like: ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: f5167872-0bc2-45a0-8d74-b5077c69280a @awolff can you point out to us why you believe that you are "expecting the CBO to move to Available: False" when you delete the provisioning CR? I agree with @sdasu that the clusterOperator conditions refer to the state of the operator, not what it is managing. Cluster version operator watches these conditions and decides whether or not to upgrade the operator... note: the PR I posted is more of a cleanup/clarification of our current state. Potentially we need a fix to 4.8 to set available=true Hi @asalkeld. I had gotten the expected behavior from this test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-38155 (unfortunately the person who wrote it is no longer at redhat) and based on the behavior I saw in version 4.8 Is there anywhere else I could see it documented? Hi @awolff see this PR and the underlying docs https://github.com/openshift/api/pull/1000 Version - Cluster version is 4.10.0-0.nightly-2021-10-02-095441 --------------------------------------------------------------- I checked in 4.10, in my opinion everything remained unchanged, just as it was in 4.9. The question is if this is what we expected or is it still a bug? [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-10-02-095441 True False 3m56s Cluster version is 4.10.0-0.nightly-2021-10-02-095441 [kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration provisioning.metal3.io "provisioning-configuration" deleted [kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare baremetal 4.10.0-0.nightly-2021-10-02-095441 True False False 33m We see now that if we use '-o yaml', we get more info: oc get clusteroperator baremetal -o yaml . . . . - lastTransitionTime: "2021-10-04T12:09:09Z" message: Provisioning CR not found reason: ProvisioningCRNotFound status: "True" type: Available ... To me the 'reason' sounds like an error message and can therefore be misleading. I'm wondering if it should be set to something in the spirit of; "reason: DeployComplete. Waiting for provisioning CR" Or perhaps even the 'message' is enough to convey the fact that the provisioning CR is missing? Not changing the reason but can change the message to say "Waiting for Provisioning CR". Tested on: 4.10.0-0.nightly-2021-11-24-030137 apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-11-24T09:40:13Z" generation: 1 name: baremetal ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: d20bc9c2-a8a9-4deb-b74e-3075d8ccf20a resourceVersion: "55888" uid: 070eff11-8029-45f0-9880-833b06afe508 spec: {} status: conditions: - lastTransitionTime: "2021-11-24T11:50:55Z" reason: WaitingForProvisioningCR status: "False" type: Progressing - lastTransitionTime: "2021-11-24T11:50:55Z" status: "False" type: Degraded - lastTransitionTime: "2021-11-24T10:02:37Z" message: Waiting for Provisioning CR reason: WaitingForProvisioningCR status: "True" type: Available - lastTransitionTime: "2021-11-24T10:01:29Z" status: "True" type: Upgradeable - lastTransitionTime: "2021-11-24T10:01:29Z" status: "False" type: Disabled extension: null relatedObjects: - group: "" name: openshift-machine-api resource: namespaces - group: metal3.io name: "" namespace: openshift-machine-api resource: baremetalhosts - group: metal3.io name: "" resource: provisioning versions: - name: operator version: 4.10.0-0.nightl Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |