Bug 2065727
| Summary: | Scaling down an hypershift cluster ends with BMH shutdown and in maintenance mode | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Javi Polo <jpolo> | ||||||||||||||
| Component: | Bare Metal Hardware Provisioning | Assignee: | Dmitry Tantsur <dtantsur> | ||||||||||||||
| Bare Metal Hardware Provisioning sub component: | baremetal-operator | QA Contact: | Pedro Amoedo <pamoedom> | ||||||||||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||||||||||
| Severity: | medium | ||||||||||||||||
| Priority: | medium | CC: | rpattath | ||||||||||||||
| Version: | 4.9 | Keywords: | Triaged | ||||||||||||||
| Target Milestone: | --- | ||||||||||||||||
| Target Release: | 4.12.0 | ||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||
| OS: | Unspecified | ||||||||||||||||
| Whiteboard: | |||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||
| Last Closed: | 2023-01-17 19:47:48 UTC | Type: | Bug | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Attachments: |
|
||||||||||||||||
|
Description
Javi Polo
2022-03-18 15:12:53 UTC
Created attachment 1866624 [details]
metal3-baremetal-operator.log
Created attachment 1866625 [details]
cluster-api-agent-provider.log
Created attachment 1866626 [details]
assisted-service.log
Created attachment 1866627 [details]
BareMetalHost.yaml CustomResource
Findings so far: when scaling down, BMO first goes down the Ironic node deletion path. For that, it sets the maintenance flag on the Ironic node and reschedule the reconciliation. On the next iteration, BMO goes down the deprovisioning path (as expected). Since the node is in maintenance still, the provisioning action is not allowed on it, and the process gets stuck. 1) I'm not sure why BMO tries the node deletion (but never completes it). My guess is that it has something to do with the detached annotation removal. 2) We need to enforce the maintenance mode or lack of it on Ironic nodes in BMO. Created attachment 1866654 [details]
full metal3-baremetal-operator.log
So far I've been unable to artificially reproduce this problem by deploying a live ISO on a BMH, detaching it and the deprovisioning. Maybe it's something that is only present on 4.9 (I'm on master, i.e. 4.11) or my testing is missing some critical step. Or it's timing-sensitive. Anyway, https://github.com/metal3-io/baremetal-operator/pull/1101 should make the maintenance mode less of a problem at least. I still hope to get to the root cause eventually, but I'm also leaving for PTO soon. I'll try to build a custom bmo image with your changes and a spawn a new cluster to see if the usability improves :) Anyway, since I can easily reproduce this behavior, if you want when you're back from PTO I can give you access to one affected cluster so you can debug it better After reproducing the behaviour, I replaced baremetal-operator image to one with Dimitry's patch, and now the nodes dont get stuck on maintenance mode :) @dtantsur do you want to keep this open to get to the root issue? or is the workaround you wrote enough to close this? As a user I cannot find the wrong behaviour anymore The workaround that did the trick for this BZ is already present in the PR#239[1] via commit d46e118fe5a055d1d84a9b62850e7d1456acee8a[2], so far so good, moving to VERIFIED. [1] - https://github.com/openshift/baremetal-operator/pull/239 [2] - https://github.com/openshift/baremetal-operator/pull/239/commits/d46e118fe5a055d1d84a9b62850e7d1456acee8a Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |