Bug 1836237 - [telco] OpenShift 4.4.3 Bare Metal IPI: Failed to perform inspection
Summary: [telco] OpenShift 4.4.3 Bare Metal IPI: Failed to perform inspection
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Julia Kreger
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-15 13:02 UTC by Jean-Francois Saucier
Modified: 2020-07-20 20:11 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-20 20:11:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jean-Francois Saucier 2020-05-15 13:02:32 UTC
Description of the problem:
On OpenShift 4.4.3 with Bare Metal IPI, the following behavior is exposed in ironic-python-agent on hardware of type HP Proliant :

1. worker node is powered off via IPMI

2. worker reboots automatically into UEFI mode

3. in UEFI mode, all NICS are exhausted

4. worker reboots into legacy BIOS mode

5. worker machines are scaled as expected

6. workers are manually reconfigured for UEFI mode and rebooted

7. in the openshift API, worker machines never reach "provisioned" state as per "oc -n openshift-machine-api get baremetalhosts"

8. workers are manually rebooted

9. when workers come up, they PXE boot, and the Ironic python agent receives the error "The following failures happened during running pre-processing hooks: Node not found hook failed: Port foo already exists, uuid: bar"

10. Workers eventually scale back down to 0, but they do not disappear from the output of "oc get bmh", nor from the ironic mariaDB database


Version-Release number of the following components:
OpenShift 4.4.3


How reproducible:
Every time

Comment 8 Julia Kreger 2020-05-20 13:59:44 UTC
Two questions: (In reply to Jean-Francois Saucier from comment #0)
> Description of the problem:
> On OpenShift 4.4.3 with Bare Metal IPI, the following behavior is exposed in
> ironic-python-agent on hardware of type HP Proliant :
> 
> 1. worker node is powered off via IPMI
> 
> 2. worker reboots automatically into UEFI mode
> 
> 3. in UEFI mode, all NICS are exhausted
> 

Can the user confirm that the NICS represented by the UEFI mode firmware indicate that they are enabled for PXE.

Also, what sort of interface is the node trying to network boot from? i.e. is it an Intel x710, or an on-board chipset?

> 4. worker reboots into legacy BIOS mode
> 
> 5. worker machines are scaled as expected
> 
> 6. workers are manually reconfigured for UEFI mode and rebooted

Can we get precise steps how this performed? Was the BareMetalHost entry changed? This would basically be an unsupportable action if performed that way.
> 
> 7. in the openshift API, worker machines never reach "provisioned" state as
> per "oc -n openshift-machine-api get baremetalhosts"

So, it seems the machines are being re-deployed, which means they are being completely re-walked through the workflow.

Is the desired end state UEFI + IPMI power control?

> 
> 8. workers are manually rebooted

What exactly is the state the machines are being observed in before being manually rebooted?

> 
> 9. when workers come up, they PXE boot, and the Ironic python agent receives
> the error "The following failures happened during running pre-processing
> hooks: Node not found hook failed: Port foo already exists, uuid: bar"


This is because the entire install workflow has been requested to be worked through again, from what I can tell.


> 
> 10. Workers eventually scale back down to 0, but they do not disappear from
> the output of "oc get bmh", nor from the ironic mariaDB database
> 
> 

It seems like the expectation is that the BMH would automatically loose records of machines that are not in use. That doesn't seem like a supportable behavior as, in essence, the cluster still owns the machine because no external infrastructure management system is being leveraged. I.e. it can't work like the cloud where the machines are freed because something still has to track resource inventory available.

> Version-Release number of the following components:
> OpenShift 4.4.3
> 
> 
> How reproducible:
> Every time

Comment 23 Stephen Benjamin 2020-06-30 16:22:45 UTC
> @Amit, the customer stated they will try to reproduce with 4.4.5 and report back. However, I did not get anything back yet. I will update as soon as I have feedback.

Has customer tried again with 4.4.5?


Note You need to log in before you can comment on or make changes to this bug.