Bug 2052699

Summary: action \"preparing\" failed: error preparing host: Have unexpected ironic node state deploying
Product: OpenShift Container Platform Reporter: Nahian <npathan>
Component: Bare Metal Hardware ProvisioningAssignee: Steven Hardy <shardy>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Amit Ugol <augol>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bverschu, zbitter
Version: 4.9Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-11 16:12:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs from metal3-xxxx -c metal3-baremetal-operator -n openshift-machine-api none

Description Nahian 2022-02-09 19:43:41 UTC
Created attachment 1860204 [details]
logs from metal3-xxxx -c metal3-baremetal-operator -n openshift-machine-api

Created attachment 1860204 [details]
logs from metal3-xxxx -c metal3-baremetal-operator -n openshift-machine-api - grep `bos2kyoung01` for the errors

Description of problem:
Errors showing up in ZT hardware from BOS2 lab due to State out of sync. 

Version-Release number of selected component (if applicable):


How reproducible:
50%. 


Actual results:
Errors on the log.


Expected results:
BMH is applied correctly 

Additional info:
Issue was showing up for other hardwares too but the logs attached are for `bos2kyoung01`. On the UI, under Bare Metal Host tab its stuck at "preparing Powering on" for the server.

Comment 1 Zane Bitter 2022-02-11 16:12:05 UTC
Looking at the BMO log, the host was stuck in "provisioning" for a long time for reasons that are unclear (might be clearer from the ironic log). At some point the host is deleted and goes into the "deprovisioning" state, where it also gets stuck because we must wait for the ironic Node to become "active" before deprovisioning (according to https://docs.openstack.org/ironic/latest/_images/states.svg). If it reached " wait call-back" then we would deprovision, but when using the live-iso boot method, as here in ZTP, ironic presumably doesn't use this state.

At this point the log abruptly stops and starts again with a brand new host. I suspect this is because the finalizer was manually removed, allowing the host to be deleted without deprovisioning or deleting the ironic node. The result is that when a new host is created the ironic Node already exists and is still in the "deploying" state.

The way to work around this would be to delete the metal3 pod to clear the ironic database after force-deleting the Host (by manually removing the finalizer).