Bug 1905826
Summary: | When deleting a provisioned bmh it gets stuck on "deprovisioning" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Daniel <dmaizel> |
Component: | Bare Metal Hardware Provisioning | Assignee: | Tomas Sedovic <tsedovic> |
Bare Metal Hardware Provisioning sub component: | ironic | QA Contact: | Amit Ugol <augol> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | bfournie, lshilin, stbenjam, zbitter |
Version: | 4.7 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-14 02:47:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Daniel
2020-12-09 07:44:39 UTC
Daniel: There were a number of fixes for this that went into 4.7 related to this, I think they landed in releases after 12/4. Can you try again with something newer? @zbitter Is this what was recently fixed? Looks similar to what you demoed. No, this was nothing to do with CAPBM. Looking at the baremetal-operator log, I see: action \"deprovisioning\" failed: failed to deprovision: failed to change provisioning state to \"deleted\": Bad request with: [PUT http://localhost:6385/v1/nodes/795350cd-26cd-4286-8189-8941e4f00319/states/provision], error message: {\"error_message\": \"{\\\"faultcode\\\": \\\"Client\\\", \\\"faultstring\\\": \\\"The requested action \\\\\\\"deleted\\\\\\\" can not be performed on node \\\\\\\"795350cd-26cd-4286-8189-8941e4f00319\\\\\\\" while it is in state \\\\\\\"clean failed\\\\\\\".\\\", \\\"debuginfo\\\": null}\"} So cleaning failed for some reason and we didn't retry. The retry is fixed in https://github.com/metal3-io/baremetal-operator/pull/737 (merged downstream in https://github.com/openshift/baremetal-operator/pull/114). I can't say why it initially failed cleaning though. The original error message from ironic is simply: "Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}." Thanks Zane! @Daniel: If you still happen to have this cluster around or can reproduce it with that release from the 4th, grabbing a screenshot of the host's console might give some clues what went wrong. Its looks like the "clean failed" issue is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1901040 which is due to PXE failure with dnsmasq when the link IPv6 local address is removed on the master node that is running dnsmasq. The evidence is in quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0605b2f28717288cd0ddd81f15536e0053508a6d3239925cc767738c5ed8f2dd/namespaces/openshift-machine-api/pods/metal3-779c7bc555-fbvlg/metal3-dnsmasq/metal3-dnsmasq/logs/current_log. When the link-local address is removed this "failed to join DHCPv6 multicast group" is logged and the subsequent DHCPv6 queries fail: 2020-12-08T16:35:22.298627184Z dnsmasq: stopped listening on enp4s0(#2): fe80::5054:ff:fe44:cf4f%enp4s0 2020-12-08T16:35:22.298627184Z dnsmasq: stopped listening on enp4s0(#2): fd00:1101::3 2020-12-08T16:35:26.789099344Z dnsmasq: interface enp4s0 failed to join DHCPv6 multicast group: Address already in use 2020-12-08T16:35:28.233649351Z dnsmasq: listening on enp4s0(#2): fd00:1101::3 We are still investigating this. I'll mark this as a duplicate of 1901040. *** This bug has been marked as a duplicate of bug 1901040 *** Could not replicate this behavior since |