1991641 – Baremetal Cluster Operator still Available After Delete Provisioning

Bug 1991641 - Baremetal Cluster Operator still Available After Delete Provisioning

Summary: Baremetal Cluster Operator still Available After Delete Provisioning

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	sdasu
QA Contact:	Adina Wolff
Docs Contact:
URL:
Whiteboard:
Depends On:	2023604
Blocks:	2011824
TreeView+	depends on / blocked

Reported:	2021-08-09 15:04 UTC by Adina Wolff
Modified:	2022-03-12 04:37 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-12 04:37:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-baremetal-operator pull 191	None	None	None	2021-08-16 23:49:19 UTC
Github	openshift cluster-baremetal-operator pull 214	None	open	Bug 1991641: Fix CO message when Provisioning CR is not present	2021-11-08 18:17:54 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-12 04:37:46 UTC

Internal Links: 2011824

Description Adina Wolff 2021-08-09 15:04:00 UTC

Version:
4.9.0-0.nightly-2021-08-07-175228

Platform:

baremetal

Please specify:
IPI

What happened?

Following these instructions:
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-38155&form_mode=edit

On a runnign OCP deployment, provisioning cr is deleted;
oc delete provisioning provisioning-configuration

metal3 pods are removed but baremetal cluster operator remains available


What did you expect to happen?

metal3 pods are removed but baremetal cluster operator switches to "Available: False"

How to reproduce it (as minimally and precisely as possible)?

oc delete provisioning provisioning-configuration

Must Gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1991641_must-gather.tar.gz

Comment 1 sdasu 2021-08-10 16:12:30 UTC

The "baremetal" cluster operator shows the state of cluster-baremetal-operator (CBO) and not the metal3 pod. So, this is working as expected.

Comment 2 Adina Wolff 2021-08-11 14:13:26 UTC

@sdasu 
The issue is that when the provisioning CR is deleted, we were expecting the CBO to move to Available: False. 
This is indeed the behavior in OCP 4.8:

[kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration
provisioning.metal3.io "provisioning-configuration" deleted
[kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare
baremetal                                  4.8.0-0.nightly-2021-08-05-031749   False       False         False      14s
[kni@provisionhost-0-0 ~]$ 


In OCP 4.9, CBO remains in state 'Available: True'

[kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration
provisioning.metal3.io "provisioning-configuration" deleted
[kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare
baremetal                                  4.9.0-0.nightly-2021-08-07-175228   True        False         False      26h     
[kni@provisionhost-0-0 ~]$

Comment 3 Angus Salkeld 2021-08-16 22:38:52 UTC

Looking at the 4.8 code https://github.com/openshift/cluster-baremetal-operator/blob/release-4.8/controllers/clusteroperator.go#L210-L226 the only
reason I can see for it going into Available=False is that the CO is deleted when we delete the provisioning CR (as there is a controllerRef) and then
it gets set to defaults which is Available=False.

some thoughts:
- should we even set the controller ref?
- the check we do here https://github.com/openshift/cluster-baremetal-operator/blob/release-4.8/controllers/clusteroperator.go#L142 is wrong as ClusterVersion puts in a reference like:
 ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: f5167872-0bc2-45a0-8d74-b5077c69280a

Comment 4 Angus Salkeld 2021-08-16 23:55:47 UTC

@awolff can you point out to us why you believe that you are "expecting the CBO to move to Available: False" when you delete the provisioning CR?

I agree with @sdasu that the clusterOperator conditions refer to the state of the operator, not what it is managing.
Cluster version operator watches these conditions and decides whether or not to upgrade the operator...

Comment 5 Angus Salkeld 2021-08-17 00:16:46 UTC

note: the PR I posted is more of a cleanup/clarification of our current state.
Potentially we need a fix to 4.8 to set available=true

Comment 6 Adina Wolff 2021-08-17 06:55:49 UTC

Hi @asalkeld. I had gotten the expected behavior from this test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-38155 (unfortunately the person who wrote it is no longer at redhat) and based on the behavior I saw in version 4.8
Is there anywhere else I could see it documented?

Comment 7 Angus Salkeld 2021-09-15 04:15:01 UTC

Hi @awolff see this PR and the underlying docs https://github.com/openshift/api/pull/1000

Comment 9 Polina Rabinovich 2021-10-04 12:51:04 UTC

Version - Cluster version is 4.10.0-0.nightly-2021-10-02-095441
---------------------------------------------------------------
I checked in 4.10, in my opinion everything remained unchanged, just as it was in 4.9. The question is if this is what we expected or is it still a bug?

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-10-02-095441   True        False         3m56s   Cluster version is 4.10.0-0.nightly-2021-10-02-095441

[kni@provisionhost-0-0 ~]$ oc delete provisioning provisioning-configuration
provisioning.metal3.io "provisioning-configuration" deleted

[kni@provisionhost-0-0 ~]$ oc get clusteroperator|grep bare
baremetal                                  4.10.0-0.nightly-2021-10-02-095441   True        False         False      33m

Comment 10 Adina Wolff 2021-10-07 08:07:54 UTC

We see now that if we use '-o yaml', we get more info:

oc get clusteroperator baremetal -o yaml
.
.
.
.
- lastTransitionTime: "2021-10-04T12:09:09Z"
    message: Provisioning CR not found
    reason: ProvisioningCRNotFound
    status: "True"
    type: Available

...

To me the 'reason' sounds like an error message and can therefore be misleading.
I'm wondering if it should be set to something in the spirit of;
"reason: DeployComplete. Waiting for provisioning CR"
Or perhaps even the 'message' is enough to convey the fact that the provisioning CR is missing?

Comment 11 sdasu 2021-11-08 18:04:50 UTC

Not changing the reason but can change the message to say "Waiting for Provisioning CR".

Comment 14 Adina Wolff 2021-11-24 11:58:20 UTC

Tested on: 4.10.0-0.nightly-2021-11-24-030137

apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2021-11-24T09:40:13Z"
  generation: 1
  name: baremetal
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: d20bc9c2-a8a9-4deb-b74e-3075d8ccf20a
  resourceVersion: "55888"
  uid: 070eff11-8029-45f0-9880-833b06afe508
spec: {}
status:
  conditions:
  - lastTransitionTime: "2021-11-24T11:50:55Z"
    reason: WaitingForProvisioningCR
    status: "False"
    type: Progressing
  - lastTransitionTime: "2021-11-24T11:50:55Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2021-11-24T10:02:37Z"
    message: Waiting for Provisioning CR
    reason: WaitingForProvisioningCR
    status: "True"
    type: Available
  - lastTransitionTime: "2021-11-24T10:01:29Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2021-11-24T10:01:29Z"
    status: "False"
    type: Disabled
  extension: null
  relatedObjects:
  - group: ""
    name: openshift-machine-api
    resource: namespaces
  - group: metal3.io
    name: ""
    namespace: openshift-machine-api
    resource: baremetalhosts
  - group: metal3.io
    name: ""
    resource: provisioning
  versions:
  - name: operator
    version: 4.10.0-0.nightl

Comment 18 errata-xmlrpc 2022-03-12 04:37:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.