Bug 1874599 - Unable to delete machine and baremetalhost objects (stuck in "deleting")
Summary: Unable to delete machine and baremetalhost objects (stuck in "deleting")
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.5.z
Assignee: Doug Hellmann
QA Contact: Lubov
URL:
Whiteboard:
Depends On: 1875558
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-01 17:39 UTC by Lars Kellogg-Stedman
Modified: 2020-11-17 14:26 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-05 12:46:54 UTC
Target Upstream Version:


Attachments (Terms of Use)
output of 'oc get bmh os-mgr-1' (9.09 KB, text/plain)
2020-09-01 17:40 UTC, Lars Kellogg-Stedman
no flags Details
output of 'oc get machine cnv-master-1 -o yaml' (3.18 KB, text/plain)
2020-09-01 17:41 UTC, Lars Kellogg-Stedman
no flags Details
output of 'oc logs metal3-6c54fcf657-4nj67 -c metal3-baremetal-operator' (16.85 MB, text/plain)
2020-09-01 17:41 UTC, Lars Kellogg-Stedman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github metal3-io baremetal-operator pull 614 0 None closed Always delete found host from Ironic on deletion 2021-01-11 09:38:53 UTC
Github openshift baremetal-operator pull 97 0 None closed Bug 1874599: Always delete found host from Ironic on deletion 2021-01-11 09:38:53 UTC
Red Hat Product Errata RHBA-2020:4325 0 None None None 2020-11-05 12:47:18 UTC

Description Lars Kellogg-Stedman 2020-09-01 17:39:29 UTC
Description of problem:

After simulating the failure of an openshift master on a 4.5.6 baremetal IPI cluster, both the machine and baremetal host objects are stuck in the "deleting" state:

$ oc get machine cnv-master-1
NAME           PHASE      TYPE   REGION   ZONE   AGE
cnv-master-1   Deleting                          20d
$ oc get baremetalhost os-mgr-1
NAME       STATUS   PROVISIONING STATUS   CONSUMER       BMC                HARDWARE PROFILE   ONLINE   ERROR
os-mgr-1   OK       deleting              cnv-master-1   ipmi://10.0.3.13                      false    

Version-Release number of selected component (if applicable):

openshift 4.5.6 baremetal IPI

Comment 1 Lars Kellogg-Stedman 2020-09-01 17:40:50 UTC
Created attachment 1713356 [details]
output of 'oc get bmh os-mgr-1'

Comment 2 Lars Kellogg-Stedman 2020-09-01 17:41:08 UTC
Created attachment 1713357 [details]
output of 'oc get machine cnv-master-1 -o yaml'

Comment 3 Lars Kellogg-Stedman 2020-09-01 17:41:43 UTC
Created attachment 1713358 [details]
output of 'oc logs metal3-6c54fcf657-4nj67 -c metal3-baremetal-operator'

Comment 4 Zane Bitter 2020-09-01 17:55:50 UTC
Thanks for the report. This looks like the same issue as in https://github.com/metal3-io/baremetal-operator/issues/482

This has been fixed upstream, but the fix is not in OpenShift 4.5.

Comment 5 Zane Bitter 2020-09-01 18:07:29 UTC
The host has this provisioning ID:

  provisioning:
    ID: 7b4c39c7-d161-4274-a180-1c6db77b7dcc

but this node doesn't exist in Ironic (though there does appear to be a node with the correct name), which is exactly the circumstances you'd expect to trigger the above bug.

The likely cause for this is that the metal3 pod has been restarted and the Ironic database rebuilt. The Host was externally provisioned, so this suggests that we are failing to update the provisioning ID when it changes, at least for an externally provisioned host.

Comment 6 Zane Bitter 2020-09-01 18:26:15 UTC
Currently the ID is only updated in the 'registering' state. An error in Ironic will force the Host into the registering state, but in this case there is no such error. If the node cannot be found by ID in Ironic, we look it up by name. This works fine, and therefore nothing forces the Host into registering and the ID never gets updated.

Comment 7 Zane Bitter 2020-09-01 21:07:04 UTC
(In reply to Zane Bitter from comment #6)
> Currently the ID is only updated in the 'registering' state. An error in
> Ironic will force the Host into the registering state, but in this case
> there is no such error. If the node cannot be found by ID in Ironic, we look
> it up by name. This works fine, and therefore nothing forces the Host into
> registering and the ID never gets updated.

This explains why the incorrect ID is able to persist for a long time, but not why it isn't being set when we first create the replacement Node in ironic.

Comment 11 Lubov 2020-10-27 10:24:13 UTC
(In reply to Lars Kellogg-Stedman from comment #0)
> Description of problem:
> 
> After simulating the failure of an openshift master on a 4.5.6 baremetal IPI
> cluster, both the machine and baremetal host objects are stuck in the
> "deleting" state:

Could you, please, provide the steps for reproducing the problem?

Comment 12 Lars Kellogg-Stedman 2020-10-27 12:11:57 UTC
I don't have a specific reproducer at this time. I'll see if I can try the same procedure a second time, but it may be a week or so before I'm able to schedule that on the cluster.

Comment 13 Lars Kellogg-Stedman 2020-10-27 16:30:08 UTC
> I didn't expect the exact reproducer, but would really appreciate if U could explain what
> did U mean by "After simulating the failure of an openshift master" 
> Did U destroy a master? Did U simply delete bmh of master?

We destroyed the master (wipedisk -fa, power off), then attempted to delete the corresponding node, nmh, and machine objects. It was at this point that the process became stuck.

Comment 14 Lubov 2020-10-28 15:45:00 UTC
Verified on virtual emulation of IPI BM for both redfish and ipmi

4.5.0-0.nightly-2020-10-25-174204

Steps to reproduce:
1. Deploy a cluster
2. Destroy virtual machine for a master and wait till corresponding node becomes NotReady
3. Delete the node
$ oc delete node master-0
4. Delete the corresponding bmh
$ oc delete bmh -n openshift-machine-api openshift-master-0
5. Wait till bmh is deleted (can take a few minutes)
6. Verify the machine is deleted as well

Comment 16 errata-xmlrpc 2020-11-05 12:46:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.17 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4325


Note You need to log in before you can comment on or make changes to this bug.