1905826 – When deleting a provisioned bmh it gets stuck on "deprovisioning"

Bug 1905826 - When deleting a provisioned bmh it gets stuck on "deprovisioning"

Summary: When deleting a provisioned bmh it gets stuck on "deprovisioning"

Keywords:
Status:	CLOSED DUPLICATE of bug 1901040
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Tomas Sedovic
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-09 07:44 UTC by Daniel
Modified:	2021-02-03 06:54 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-14 02:47:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Daniel 2020-12-09 07:44:39 UTC

Description of problem:

While trying to delete a bmh via:
oc delete bmh openshift-worker-0-0 -n openshift-machine-api
The session hangs.

Version-Release number of selected component (if applicable):
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-04-013308   True        False         15h     Cluster version is 4.7.0-0.nightly-2020-12-04-013308

How reproducible:
Constantly

Steps to Reproduce:
Delete any one of the provisioned bmh's
1.oc delete bmh/openshift-worker-0-0 -n openshift-machine-api
2.
3.

Actual results:
bmh is not deleted

Expected results:
bmh should be deleted

Additional info:

Comment 1 Daniel 2020-12-09 08:35:42 UTC

must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1905826

Comment 2 Stephen Benjamin 2020-12-10 19:09:50 UTC

Daniel: There were a number of fixes for this that went into 4.7 related to this, I think they landed in releases after 12/4. Can you try again with something newer?  

@zbitter Is this what was recently fixed? Looks similar to what you demoed.

Comment 3 Zane Bitter 2020-12-10 19:42:46 UTC

No, this was nothing to do with CAPBM. Looking at the baremetal-operator log, I see:

action \"deprovisioning\" failed: failed to deprovision: failed to change provisioning state to \"deleted\": Bad request with: [PUT http://localhost:6385/v1/nodes/795350cd-26cd-4286-8189-8941e4f00319/states/provision], error message: {\"error_message\": \"{\\\"faultcode\\\": \\\"Client\\\", \\\"faultstring\\\": \\\"The requested action \\\\\\\"deleted\\\\\\\" can not be performed on node \\\\\\\"795350cd-26cd-4286-8189-8941e4f00319\\\\\\\" while it is in state \\\\\\\"clean failed\\\\\\\".\\\", \\\"debuginfo\\\": null}\"}

So cleaning failed for some reason and we didn't retry. The retry is fixed in https://github.com/metal3-io/baremetal-operator/pull/737 (merged downstream in https://github.com/openshift/baremetal-operator/pull/114).

I can't say why it initially failed cleaning though. The original error message from ironic is simply:

"Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}."

Comment 4 Stephen Benjamin 2020-12-10 19:54:28 UTC

Thanks Zane!

@Daniel: If you still happen to have this cluster around or can reproduce it with that release from the 4th, grabbing a screenshot of the host's console might give some clues what went wrong.

Comment 5 Bob Fournier 2020-12-14 02:47:48 UTC

Its looks like the "clean failed" issue is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1901040 which is due to PXE failure with dnsmasq when the link IPv6 local address is removed on the master node that is running dnsmasq. 

The evidence is in quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0605b2f28717288cd0ddd81f15536e0053508a6d3239925cc767738c5ed8f2dd/namespaces/openshift-machine-api/pods/metal3-779c7bc555-fbvlg/metal3-dnsmasq/metal3-dnsmasq/logs/current_log.  When the link-local address is removed this "failed to join DHCPv6 multicast group" is logged and the subsequent DHCPv6 queries fail:

2020-12-08T16:35:22.298627184Z dnsmasq: stopped listening on enp4s0(#2): fe80::5054:ff:fe44:cf4f%enp4s0
2020-12-08T16:35:22.298627184Z dnsmasq: stopped listening on enp4s0(#2): fd00:1101::3
2020-12-08T16:35:26.789099344Z dnsmasq: interface enp4s0 failed to join DHCPv6 multicast group: Address already in use
2020-12-08T16:35:28.233649351Z dnsmasq: listening on enp4s0(#2): fd00:1101::3
 

We are still investigating this. I'll mark this as a duplicate of 1901040.

*** This bug has been marked as a duplicate of bug 1901040 ***

Comment 6 Daniel 2021-02-03 06:54:35 UTC

Could not replicate this behavior since

Note You need to log in before you can comment on or make changes to this bug.