Bug 1905826

Summary:	When deleting a provisioned bmh it gets stuck on "deprovisioning"
Product:	OpenShift Container Platform	Reporter:	Daniel <dmaizel>
Component:	Bare Metal Hardware Provisioning	Assignee:	Tomas Sedovic <tsedovic>
Bare Metal Hardware Provisioning sub component:	ironic	QA Contact:	Amit Ugol <augol>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	bfournie, lshilin, stbenjam, zbitter
Version:	4.7
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-14 02:47:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daniel 2020-12-09 07:44:39 UTC

Description of problem:

While trying to delete a bmh via:
oc delete bmh openshift-worker-0-0 -n openshift-machine-api
The session hangs.

Version-Release number of selected component (if applicable):
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-04-013308   True        False         15h     Cluster version is 4.7.0-0.nightly-2020-12-04-013308

How reproducible:
Constantly

Steps to Reproduce:
Delete any one of the provisioned bmh's
1.oc delete bmh/openshift-worker-0-0 -n openshift-machine-api
2.
3.

Actual results:
bmh is not deleted

Expected results:
bmh should be deleted

Additional info:

Comment 1 Daniel 2020-12-09 08:35:42 UTC

must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1905826

Comment 2 Stephen Benjamin 2020-12-10 19:09:50 UTC

Daniel: There were a number of fixes for this that went into 4.7 related to this, I think they landed in releases after 12/4. Can you try again with something newer?  

@zbitter Is this what was recently fixed? Looks similar to what you demoed.

Comment 3 Zane Bitter 2020-12-10 19:42:46 UTC

No, this was nothing to do with CAPBM. Looking at the baremetal-operator log, I see:

action \"deprovisioning\" failed: failed to deprovision: failed to change provisioning state to \"deleted\": Bad request with: [PUT http://localhost:6385/v1/nodes/795350cd-26cd-4286-8189-8941e4f00319/states/provision], error message: {\"error_message\": \"{\\\"faultcode\\\": \\\"Client\\\", \\\"faultstring\\\": \\\"The requested action \\\\\\\"deleted\\\\\\\" can not be performed on node \\\\\\\"795350cd-26cd-4286-8189-8941e4f00319\\\\\\\" while it is in state \\\\\\\"clean failed\\\\\\\".\\\", \\\"debuginfo\\\": null}\"}

So cleaning failed for some reason and we didn't retry. The retry is fixed in https://github.com/metal3-io/baremetal-operator/pull/737 (merged downstream in https://github.com/openshift/baremetal-operator/pull/114).

I can't say why it initially failed cleaning though. The original error message from ironic is simply:

"Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}."

Comment 4 Stephen Benjamin 2020-12-10 19:54:28 UTC

Thanks Zane!

@Daniel: If you still happen to have this cluster around or can reproduce it with that release from the 4th, grabbing a screenshot of the host's console might give some clues what went wrong.

Comment 5 Bob Fournier 2020-12-14 02:47:48 UTC

Its looks like the "clean failed" issue is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1901040 which is due to PXE failure with dnsmasq when the link IPv6 local address is removed on the master node that is running dnsmasq. 

The evidence is in quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0605b2f28717288cd0ddd81f15536e0053508a6d3239925cc767738c5ed8f2dd/namespaces/openshift-machine-api/pods/metal3-779c7bc555-fbvlg/metal3-dnsmasq/metal3-dnsmasq/logs/current_log.  When the link-local address is removed this "failed to join DHCPv6 multicast group" is logged and the subsequent DHCPv6 queries fail:

2020-12-08T16:35:22.298627184Z dnsmasq: stopped listening on enp4s0(#2): fe80::5054:ff:fe44:cf4f%enp4s0
2020-12-08T16:35:22.298627184Z dnsmasq: stopped listening on enp4s0(#2): fd00:1101::3
2020-12-08T16:35:26.789099344Z dnsmasq: interface enp4s0 failed to join DHCPv6 multicast group: Address already in use
2020-12-08T16:35:28.233649351Z dnsmasq: listening on enp4s0(#2): fd00:1101::3
 

We are still investigating this. I'll mark this as a duplicate of 1901040.

*** This bug has been marked as a duplicate of bug 1901040 ***

Comment 6 Daniel 2021-02-03 06:54:35 UTC

Could not replicate this behavior since