Bug 1874599

Summary: Unable to delete machine and baremetalhost objects (stuck in "deleting")
Product: OpenShift Container Platform Reporter: Lars Kellogg-Stedman <lars>
Component: Bare Metal Hardware ProvisioningAssignee: Doug Hellmann <dhellmann>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: lshilin, sdasu, zbitter
Version: 4.5Keywords: Triaged
Target Milestone: ---   
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-05 12:46:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1875558    
Bug Blocks:    
Attachments:
Description Flags
output of 'oc get bmh os-mgr-1'
none
output of 'oc get machine cnv-master-1 -o yaml'
none
output of 'oc logs metal3-6c54fcf657-4nj67 -c metal3-baremetal-operator' none

Description Lars Kellogg-Stedman 2020-09-01 17:39:29 UTC
Description of problem:

After simulating the failure of an openshift master on a 4.5.6 baremetal IPI cluster, both the machine and baremetal host objects are stuck in the "deleting" state:

$ oc get machine cnv-master-1
NAME           PHASE      TYPE   REGION   ZONE   AGE
cnv-master-1   Deleting                          20d
$ oc get baremetalhost os-mgr-1
NAME       STATUS   PROVISIONING STATUS   CONSUMER       BMC                HARDWARE PROFILE   ONLINE   ERROR
os-mgr-1   OK       deleting              cnv-master-1   ipmi://10.0.3.13                      false    

Version-Release number of selected component (if applicable):

openshift 4.5.6 baremetal IPI

Comment 1 Lars Kellogg-Stedman 2020-09-01 17:40:50 UTC
Created attachment 1713356 [details]
output of 'oc get bmh os-mgr-1'

Comment 2 Lars Kellogg-Stedman 2020-09-01 17:41:08 UTC
Created attachment 1713357 [details]
output of 'oc get machine cnv-master-1 -o yaml'

Comment 3 Lars Kellogg-Stedman 2020-09-01 17:41:43 UTC
Created attachment 1713358 [details]
output of 'oc logs metal3-6c54fcf657-4nj67 -c metal3-baremetal-operator'

Comment 4 Zane Bitter 2020-09-01 17:55:50 UTC
Thanks for the report. This looks like the same issue as in https://github.com/metal3-io/baremetal-operator/issues/482

This has been fixed upstream, but the fix is not in OpenShift 4.5.

Comment 5 Zane Bitter 2020-09-01 18:07:29 UTC
The host has this provisioning ID:

  provisioning:
    ID: 7b4c39c7-d161-4274-a180-1c6db77b7dcc

but this node doesn't exist in Ironic (though there does appear to be a node with the correct name), which is exactly the circumstances you'd expect to trigger the above bug.

The likely cause for this is that the metal3 pod has been restarted and the Ironic database rebuilt. The Host was externally provisioned, so this suggests that we are failing to update the provisioning ID when it changes, at least for an externally provisioned host.

Comment 6 Zane Bitter 2020-09-01 18:26:15 UTC
Currently the ID is only updated in the 'registering' state. An error in Ironic will force the Host into the registering state, but in this case there is no such error. If the node cannot be found by ID in Ironic, we look it up by name. This works fine, and therefore nothing forces the Host into registering and the ID never gets updated.

Comment 7 Zane Bitter 2020-09-01 21:07:04 UTC
(In reply to Zane Bitter from comment #6)
> Currently the ID is only updated in the 'registering' state. An error in
> Ironic will force the Host into the registering state, but in this case
> there is no such error. If the node cannot be found by ID in Ironic, we look
> it up by name. This works fine, and therefore nothing forces the Host into
> registering and the ID never gets updated.

This explains why the incorrect ID is able to persist for a long time, but not why it isn't being set when we first create the replacement Node in ironic.

Comment 11 Lubov 2020-10-27 10:24:13 UTC
(In reply to Lars Kellogg-Stedman from comment #0)
> Description of problem:
> 
> After simulating the failure of an openshift master on a 4.5.6 baremetal IPI
> cluster, both the machine and baremetal host objects are stuck in the
> "deleting" state:

Could you, please, provide the steps for reproducing the problem?

Comment 12 Lars Kellogg-Stedman 2020-10-27 12:11:57 UTC
I don't have a specific reproducer at this time. I'll see if I can try the same procedure a second time, but it may be a week or so before I'm able to schedule that on the cluster.

Comment 13 Lars Kellogg-Stedman 2020-10-27 16:30:08 UTC
> I didn't expect the exact reproducer, but would really appreciate if U could explain what
> did U mean by "After simulating the failure of an openshift master" 
> Did U destroy a master? Did U simply delete bmh of master?

We destroyed the master (wipedisk -fa, power off), then attempted to delete the corresponding node, nmh, and machine objects. It was at this point that the process became stuck.

Comment 14 Lubov 2020-10-28 15:45:00 UTC
Verified on virtual emulation of IPI BM for both redfish and ipmi

4.5.0-0.nightly-2020-10-25-174204

Steps to reproduce:
1. Deploy a cluster
2. Destroy virtual machine for a master and wait till corresponding node becomes NotReady
3. Delete the node
$ oc delete node master-0
4. Delete the corresponding bmh
$ oc delete bmh -n openshift-machine-api openshift-master-0
5. Wait till bmh is deleted (can take a few minutes)
6. Verify the machine is deleted as well

Comment 16 errata-xmlrpc 2020-11-05 12:46:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.17 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4325