Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1836237

Summary:	[telco] OpenShift 4.4.3 Bare Metal IPI: Failed to perform inspection
Product:	OpenShift Container Platform	Reporter:	Jean-Francois Saucier <jsaucier>
Component:	Bare Metal Hardware Provisioning	Assignee:	Julia Kreger <jkreger>
Bare Metal Hardware Provisioning sub component:	ironic	QA Contact:	Raviv Bar-Tal <rbartal>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	medium
Priority:	medium	CC:	athomas, beth.white, dtantsur, ealcaniz, fsimonce, hpokorny, jiazhang, jkreger, rlopez, stbenjam
Version:	4.4	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-20 20:11:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jean-Francois Saucier 2020-05-15 13:02:32 UTC

Description of the problem:
On OpenShift 4.4.3 with Bare Metal IPI, the following behavior is exposed in ironic-python-agent on hardware of type HP Proliant :

1. worker node is powered off via IPMI

2. worker reboots automatically into UEFI mode

3. in UEFI mode, all NICS are exhausted

4. worker reboots into legacy BIOS mode

5. worker machines are scaled as expected

6. workers are manually reconfigured for UEFI mode and rebooted

7. in the openshift API, worker machines never reach "provisioned" state as per "oc -n openshift-machine-api get baremetalhosts"

8. workers are manually rebooted

9. when workers come up, they PXE boot, and the Ironic python agent receives the error "The following failures happened during running pre-processing hooks: Node not found hook failed: Port foo already exists, uuid: bar"

10. Workers eventually scale back down to 0, but they do not disappear from the output of "oc get bmh", nor from the ironic mariaDB database


Version-Release number of the following components:
OpenShift 4.4.3


How reproducible:
Every time

Comment 8 Julia Kreger 2020-05-20 13:59:44 UTC

Two questions: (In reply to Jean-Francois Saucier from comment #0)
> Description of the problem:
> On OpenShift 4.4.3 with Bare Metal IPI, the following behavior is exposed in
> ironic-python-agent on hardware of type HP Proliant :
> 
> 1. worker node is powered off via IPMI
> 
> 2. worker reboots automatically into UEFI mode
> 
> 3. in UEFI mode, all NICS are exhausted
> 

Can the user confirm that the NICS represented by the UEFI mode firmware indicate that they are enabled for PXE.

Also, what sort of interface is the node trying to network boot from? i.e. is it an Intel x710, or an on-board chipset?

> 4. worker reboots into legacy BIOS mode
> 
> 5. worker machines are scaled as expected
> 
> 6. workers are manually reconfigured for UEFI mode and rebooted

Can we get precise steps how this performed? Was the BareMetalHost entry changed? This would basically be an unsupportable action if performed that way.
> 
> 7. in the openshift API, worker machines never reach "provisioned" state as
> per "oc -n openshift-machine-api get baremetalhosts"

So, it seems the machines are being re-deployed, which means they are being completely re-walked through the workflow.

Is the desired end state UEFI + IPMI power control?

> 
> 8. workers are manually rebooted

What exactly is the state the machines are being observed in before being manually rebooted?

> 
> 9. when workers come up, they PXE boot, and the Ironic python agent receives
> the error "The following failures happened during running pre-processing
> hooks: Node not found hook failed: Port foo already exists, uuid: bar"


This is because the entire install workflow has been requested to be worked through again, from what I can tell.


> 
> 10. Workers eventually scale back down to 0, but they do not disappear from
> the output of "oc get bmh", nor from the ironic mariaDB database
> 
> 

It seems like the expectation is that the BMH would automatically loose records of machines that are not in use. That doesn't seem like a supportable behavior as, in essence, the cluster still owns the machine because no external infrastructure management system is being leveraged. I.e. it can't work like the cloud where the machines are freed because something still has to track resource inventory available.

> Version-Release number of the following components:
> OpenShift 4.4.3
> 
> 
> How reproducible:
> Every time

Comment 23 Stephen Benjamin 2020-06-30 16:22:45 UTC

> @Amit, the customer stated they will try to reproduce with 4.4.5 and report back. However, I did not get anything back yet. I will update as soon as I have feedback.

Has customer tried again with 4.4.5?