Description of problem:
After simulating the failure of an openshift master on a 4.5.6 baremetal IPI cluster, both the machine and baremetal host objects are stuck in the "deleting" state:
$ oc get machine cnv-master-1
NAME PHASE TYPE REGION ZONE AGE
cnv-master-1 Deleting 20d
$ oc get baremetalhost os-mgr-1
NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR
os-mgr-1 OK deleting cnv-master-1 ipmi://10.0.3.13 false
Version-Release number of selected component (if applicable):
openshift 4.5.6 baremetal IPI
Created attachment 1713356 [details]
output of 'oc get bmh os-mgr-1'
Created attachment 1713357 [details]
output of 'oc get machine cnv-master-1 -o yaml'
Created attachment 1713358 [details]
output of 'oc logs metal3-6c54fcf657-4nj67 -c metal3-baremetal-operator'
Thanks for the report. This looks like the same issue as in https://github.com/metal3-io/baremetal-operator/issues/482
This has been fixed upstream, but the fix is not in OpenShift 4.5.
The host has this provisioning ID:
but this node doesn't exist in Ironic (though there does appear to be a node with the correct name), which is exactly the circumstances you'd expect to trigger the above bug.
The likely cause for this is that the metal3 pod has been restarted and the Ironic database rebuilt. The Host was externally provisioned, so this suggests that we are failing to update the provisioning ID when it changes, at least for an externally provisioned host.
Currently the ID is only updated in the 'registering' state. An error in Ironic will force the Host into the registering state, but in this case there is no such error. If the node cannot be found by ID in Ironic, we look it up by name. This works fine, and therefore nothing forces the Host into registering and the ID never gets updated.
(In reply to Zane Bitter from comment #6)
> Currently the ID is only updated in the 'registering' state. An error in
> Ironic will force the Host into the registering state, but in this case
> there is no such error. If the node cannot be found by ID in Ironic, we look
> it up by name. This works fine, and therefore nothing forces the Host into
> registering and the ID never gets updated.
This explains why the incorrect ID is able to persist for a long time, but not why it isn't being set when we first create the replacement Node in ironic.
(In reply to Lars Kellogg-Stedman from comment #0)
> Description of problem:
> After simulating the failure of an openshift master on a 4.5.6 baremetal IPI
> cluster, both the machine and baremetal host objects are stuck in the
> "deleting" state:
Could you, please, provide the steps for reproducing the problem?
I don't have a specific reproducer at this time. I'll see if I can try the same procedure a second time, but it may be a week or so before I'm able to schedule that on the cluster.
> I didn't expect the exact reproducer, but would really appreciate if U could explain what
> did U mean by "After simulating the failure of an openshift master"
> Did U destroy a master? Did U simply delete bmh of master?
We destroyed the master (wipedisk -fa, power off), then attempted to delete the corresponding node, nmh, and machine objects. It was at this point that the process became stuck.
Verified on virtual emulation of IPI BM for both redfish and ipmi
Steps to reproduce:
1. Deploy a cluster
2. Destroy virtual machine for a master and wait till corresponding node becomes NotReady
3. Delete the node
$ oc delete node master-0
4. Delete the corresponding bmh
$ oc delete bmh -n openshift-machine-api openshift-master-0
5. Wait till bmh is deleted (can take a few minutes)
6. Verify the machine is deleted as well
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.5.17 bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.