Bug 1972426
Summary: | Adopt failure can trigger deprovisioning | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Zane Bitter <zbitter> | |
Component: | Bare Metal Hardware Provisioning | Assignee: | Zane Bitter <zbitter> | |
Bare Metal Hardware Provisioning sub component: | baremetal-operator | QA Contact: | Ori Michaeli <omichael> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | augol, beth.white, omichael, shardy | |
Version: | 4.8 | Keywords: | Triaged, UpcomingSprint | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: An error (other than a registration error) in the provisioned state would cause the Host to be deprovisioned.
Consequence: After a restart of the metal3 pod (such as during an upgrade), if the image provisioned to a BareMetalHost was no longer accessible, the Host would enter the deprovisioning state, and be stuck there until the image became available (at which time it would be deprovisioned).
Fix: An error in the provisioned state is now reported without triggering deprovisioning.
Result: If the image become unavailable, the error will be reported but deprovisioning will not be initiated.
|
Story Points: | --- | |
Clone Of: | 1972374 | |||
: | 1972430 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 23:12:54 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1972374 | |||
Bug Blocks: | 1972430 |
Description
Zane Bitter
2021-06-15 20:47:13 UTC
It appears that bug 1972572 will trigger this one on every upgrade from 4.7->4.8 (even if disk space is not exhausted) by deleting the cached image and not recreating it. The good news is that it is perma-broken so that deprovisioning can't actually proceed, but that bad news is that the hosts will have moved to the deprovisioning state (so the Machine will become failed, potentially causing Node draining if MachineHealthCheck is enabled), and when a fix replaces the image deprovisioning of all workers will proceed. So I agree that this is definitely a blocker for 4.8. There's no such mechanism known for 4.7, so no reason to panic there although obviously this should be fixed. Verified with 4.8.0-0.nightly-2021-06-19-005119: [kni@provisionhost-0-0 ~]$ oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR openshift-machine-api openshift-master-0-0 externally provisioned ocp-edge-cluster-0-mtqwf-master-0 true provisioned registration error openshift-machine-api openshift-master-0-1 externally provisioned ocp-edge-cluster-0-mtqwf-master-1 true provisioned registration error openshift-machine-api openshift-master-0-2 externally provisioned ocp-edge-cluster-0-mtqwf-master-2 true provisioned registration error openshift-machine-api openshift-worker-0-0 provisioned ocp-edge-cluster-0-mtqwf-worker-0-rrz7k true provisioned registration error openshift-machine-api openshift-worker-0-1 provisioned ocp-edge-cluster-0-mtqwf-worker-0-bgqm2 true provisioned registration error Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |