Description of problem: While trying to delete a bmh via: oc delete bmh openshift-worker-0-0 -n openshift-machine-api The session hangs. Version-Release number of selected component (if applicable): NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-12-04-013308 True False 15h Cluster version is 4.7.0-0.nightly-2020-12-04-013308 How reproducible: Constantly Steps to Reproduce: Delete any one of the provisioned bmh's 1.oc delete bmh/openshift-worker-0-0 -n openshift-machine-api 2. 3. Actual results: bmh is not deleted Expected results: bmh should be deleted Additional info:
must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather-bz1905826
Daniel: There were a number of fixes for this that went into 4.7 related to this, I think they landed in releases after 12/4. Can you try again with something newer? @zbitter Is this what was recently fixed? Looks similar to what you demoed.
No, this was nothing to do with CAPBM. Looking at the baremetal-operator log, I see: action \"deprovisioning\" failed: failed to deprovision: failed to change provisioning state to \"deleted\": Bad request with: [PUT http://localhost:6385/v1/nodes/795350cd-26cd-4286-8189-8941e4f00319/states/provision], error message: {\"error_message\": \"{\\\"faultcode\\\": \\\"Client\\\", \\\"faultstring\\\": \\\"The requested action \\\\\\\"deleted\\\\\\\" can not be performed on node \\\\\\\"795350cd-26cd-4286-8189-8941e4f00319\\\\\\\" while it is in state \\\\\\\"clean failed\\\\\\\".\\\", \\\"debuginfo\\\": null}\"} So cleaning failed for some reason and we didn't retry. The retry is fixed in https://github.com/metal3-io/baremetal-operator/pull/737 (merged downstream in https://github.com/openshift/baremetal-operator/pull/114). I can't say why it initially failed cleaning though. The original error message from ironic is simply: "Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}."
Thanks Zane! @Daniel: If you still happen to have this cluster around or can reproduce it with that release from the 4th, grabbing a screenshot of the host's console might give some clues what went wrong.
Its looks like the "clean failed" issue is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1901040 which is due to PXE failure with dnsmasq when the link IPv6 local address is removed on the master node that is running dnsmasq. The evidence is in quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0605b2f28717288cd0ddd81f15536e0053508a6d3239925cc767738c5ed8f2dd/namespaces/openshift-machine-api/pods/metal3-779c7bc555-fbvlg/metal3-dnsmasq/metal3-dnsmasq/logs/current_log. When the link-local address is removed this "failed to join DHCPv6 multicast group" is logged and the subsequent DHCPv6 queries fail: 2020-12-08T16:35:22.298627184Z dnsmasq: stopped listening on enp4s0(#2): fe80::5054:ff:fe44:cf4f%enp4s0 2020-12-08T16:35:22.298627184Z dnsmasq: stopped listening on enp4s0(#2): fd00:1101::3 2020-12-08T16:35:26.789099344Z dnsmasq: interface enp4s0 failed to join DHCPv6 multicast group: Address already in use 2020-12-08T16:35:28.233649351Z dnsmasq: listening on enp4s0(#2): fd00:1101::3 We are still investigating this. I'll mark this as a duplicate of 1901040. *** This bug has been marked as a duplicate of bug 1901040 ***
Could not replicate this behavior since