Bug 1991641

Summary: Baremetal Cluster Operator still Available After Delete Provisioning
Product: OpenShift Container Platform Reporter: Adina Wolff <awolff>
Component: InstallerAssignee: sdasu
Installer sub component: OpenShift on Bare Metal IPI QA Contact: Adina Wolff <awolff>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: amalykhi, lshilin, prabinov, sdasu, wking
Version: 4.9Keywords: Reopened, Triaged
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-12 04:37:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2023604    
Bug Blocks: 2011824    

Description Adina Wolff 2021-08-09 15:04:00 UTC
Version:
4.9.0-0.nightly-2021-08-07-175228

Platform:

baremetal

Please specify:
IPI

What happened?

Following these instructions:
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-38155&form_mode=edit

On a runnign OCP deployment, provisioning cr is deleted;
oc delete provisioning provisioning-configuration

metal3 pods are removed but baremetal cluster operator remains available


What did you expect to happen?

metal3 pods are removed but baremetal cluster operator switches to "Available: False"

How to reproduce it (as minimally and precisely as possible)?

oc delete provisioning provisioning-configuration

Must Gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1991641_must-gather.tar.gz

Comment 1 sdasu 2021-08-10 16:12:30 UTC
The "baremetal" cluster operator shows the state of cluster-baremetal-operator (CBO) and not the metal3 pod. So, this is working as expected.

Comment 2 Adina Wolff 2021-08-11 14:13:26 UTC
@sdasu 
The issue is that when the provisioning CR is deleted, we were expecting the CBO to move to Available: False. 
This is indeed the behavior in OCP 4.8:

[kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration
provisioning.metal3.io "provisioning-configuration" deleted
[kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare
baremetal                                  4.8.0-0.nightly-2021-08-05-031749   False       False         False      14s
[kni@provisionhost-0-0 ~]$ 


In OCP 4.9, CBO remains in state 'Available: True'

[kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration
provisioning.metal3.io "provisioning-configuration" deleted
[kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare
baremetal                                  4.9.0-0.nightly-2021-08-07-175228   True        False         False      26h     
[kni@provisionhost-0-0 ~]$

Comment 3 Angus Salkeld 2021-08-16 22:38:52 UTC
Looking at the 4.8 code https://github.com/openshift/cluster-baremetal-operator/blob/release-4.8/controllers/clusteroperator.go#L210-L226 the only
reason I can see for it going into Available=False is that the CO is deleted when we delete the provisioning CR (as there is a controllerRef) and then
it gets set to defaults which is Available=False.

some thoughts:
- should we even set the controller ref?
- the check we do here https://github.com/openshift/cluster-baremetal-operator/blob/release-4.8/controllers/clusteroperator.go#L142 is wrong as ClusterVersion puts in a reference like:
 ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: f5167872-0bc2-45a0-8d74-b5077c69280a

Comment 4 Angus Salkeld 2021-08-16 23:55:47 UTC
@awolff can you point out to us why you believe that you are "expecting the CBO to move to Available: False" when you delete the provisioning CR?

I agree with @sdasu that the clusterOperator conditions refer to the state of the operator, not what it is managing.
Cluster version operator watches these conditions and decides whether or not to upgrade the operator...

Comment 5 Angus Salkeld 2021-08-17 00:16:46 UTC
note: the PR I posted is more of a cleanup/clarification of our current state.
Potentially we need a fix to 4.8 to set available=true

Comment 6 Adina Wolff 2021-08-17 06:55:49 UTC
Hi @asalkeld. I had gotten the expected behavior from this test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-38155 (unfortunately the person who wrote it is no longer at redhat) and based on the behavior I saw in version 4.8
Is there anywhere else I could see it documented?

Comment 7 Angus Salkeld 2021-09-15 04:15:01 UTC
Hi @awolff see this PR and the underlying docs https://github.com/openshift/api/pull/1000

Comment 9 Polina Rabinovich 2021-10-04 12:51:04 UTC
Version - Cluster version is 4.10.0-0.nightly-2021-10-02-095441
---------------------------------------------------------------
I checked in 4.10, in my opinion everything remained unchanged, just as it was in 4.9. The question is if this is what we expected or is it still a bug?

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-10-02-095441   True        False         3m56s   Cluster version is 4.10.0-0.nightly-2021-10-02-095441

[kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration
provisioning.metal3.io "provisioning-configuration" deleted

[kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare
baremetal                                  4.10.0-0.nightly-2021-10-02-095441   True        False         False      33m

Comment 10 Adina Wolff 2021-10-07 08:07:54 UTC
We see now that if we use '-o yaml', we get more info:

oc get clusteroperator baremetal -o yaml
.
.
.
.
- lastTransitionTime: "2021-10-04T12:09:09Z"
    message: Provisioning CR not found
    reason: ProvisioningCRNotFound
    status: "True"
    type: Available

...

To me the 'reason' sounds like an error message and can therefore be misleading.
I'm wondering if it should be set to something in the spirit of;
"reason: DeployComplete. Waiting for provisioning CR"
Or perhaps even the 'message' is enough to convey the fact that the provisioning CR is missing?

Comment 11 sdasu 2021-11-08 18:04:50 UTC
Not changing the reason but can change the message to say "Waiting for Provisioning CR".

Comment 14 Adina Wolff 2021-11-24 11:58:20 UTC
Tested on: 4.10.0-0.nightly-2021-11-24-030137

apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2021-11-24T09:40:13Z"
  generation: 1
  name: baremetal
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: d20bc9c2-a8a9-4deb-b74e-3075d8ccf20a
  resourceVersion: "55888"
  uid: 070eff11-8029-45f0-9880-833b06afe508
spec: {}
status:
  conditions:
  - lastTransitionTime: "2021-11-24T11:50:55Z"
    reason: WaitingForProvisioningCR
    status: "False"
    type: Progressing
  - lastTransitionTime: "2021-11-24T11:50:55Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2021-11-24T10:02:37Z"
    message: Waiting for Provisioning CR
    reason: WaitingForProvisioningCR
    status: "True"
    type: Available
  - lastTransitionTime: "2021-11-24T10:01:29Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2021-11-24T10:01:29Z"
    status: "False"
    type: Disabled
  extension: null
  relatedObjects:
  - group: ""
    name: openshift-machine-api
    resource: namespaces
  - group: metal3.io
    name: ""
    namespace: openshift-machine-api
    resource: baremetalhosts
  - group: metal3.io
    name: ""
    resource: provisioning
  versions:
  - name: operator
    version: 4.10.0-0.nightl

Comment 18 errata-xmlrpc 2022-03-12 04:37:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056