Description of the problem: On OpenShift 4.4.3 with Bare Metal IPI, the following behavior is exposed in ironic-python-agent on hardware of type HP Proliant : 1. worker node is powered off via IPMI 2. worker reboots automatically into UEFI mode 3. in UEFI mode, all NICS are exhausted 4. worker reboots into legacy BIOS mode 5. worker machines are scaled as expected 6. workers are manually reconfigured for UEFI mode and rebooted 7. in the openshift API, worker machines never reach "provisioned" state as per "oc -n openshift-machine-api get baremetalhosts" 8. workers are manually rebooted 9. when workers come up, they PXE boot, and the Ironic python agent receives the error "The following failures happened during running pre-processing hooks: Node not found hook failed: Port foo already exists, uuid: bar" 10. Workers eventually scale back down to 0, but they do not disappear from the output of "oc get bmh", nor from the ironic mariaDB database Version-Release number of the following components: OpenShift 4.4.3 How reproducible: Every time
Two questions: (In reply to Jean-Francois Saucier from comment #0) > Description of the problem: > On OpenShift 4.4.3 with Bare Metal IPI, the following behavior is exposed in > ironic-python-agent on hardware of type HP Proliant : > > 1. worker node is powered off via IPMI > > 2. worker reboots automatically into UEFI mode > > 3. in UEFI mode, all NICS are exhausted > Can the user confirm that the NICS represented by the UEFI mode firmware indicate that they are enabled for PXE. Also, what sort of interface is the node trying to network boot from? i.e. is it an Intel x710, or an on-board chipset? > 4. worker reboots into legacy BIOS mode > > 5. worker machines are scaled as expected > > 6. workers are manually reconfigured for UEFI mode and rebooted Can we get precise steps how this performed? Was the BareMetalHost entry changed? This would basically be an unsupportable action if performed that way. > > 7. in the openshift API, worker machines never reach "provisioned" state as > per "oc -n openshift-machine-api get baremetalhosts" So, it seems the machines are being re-deployed, which means they are being completely re-walked through the workflow. Is the desired end state UEFI + IPMI power control? > > 8. workers are manually rebooted What exactly is the state the machines are being observed in before being manually rebooted? > > 9. when workers come up, they PXE boot, and the Ironic python agent receives > the error "The following failures happened during running pre-processing > hooks: Node not found hook failed: Port foo already exists, uuid: bar" This is because the entire install workflow has been requested to be worked through again, from what I can tell. > > 10. Workers eventually scale back down to 0, but they do not disappear from > the output of "oc get bmh", nor from the ironic mariaDB database > > It seems like the expectation is that the BMH would automatically loose records of machines that are not in use. That doesn't seem like a supportable behavior as, in essence, the cluster still owns the machine because no external infrastructure management system is being leveraged. I.e. it can't work like the cloud where the machines are freed because something still has to track resource inventory available. > Version-Release number of the following components: > OpenShift 4.4.3 > > > How reproducible: > Every time
> @Amit, the customer stated they will try to reproduce with 4.4.5 and report back. However, I did not get anything back yet. I will update as soon as I have feedback. Has customer tried again with 4.4.5?