Bug 1972426

Summary:	Adopt failure can trigger deprovisioning
Product:	OpenShift Container Platform	Reporter:	Zane Bitter <zbitter>
Component:	Bare Metal Hardware Provisioning	Assignee:	Zane Bitter <zbitter>
Bare Metal Hardware Provisioning sub component:	baremetal-operator	QA Contact:	Ori Michaeli <omichael>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	augol, beth.white, omichael, shardy
Version:	4.8	Keywords:	Triaged, UpcomingSprint
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: An error (other than a registration error) in the provisioned state would cause the Host to be deprovisioned. Consequence: After a restart of the metal3 pod (such as during an upgrade), if the image provisioned to a BareMetalHost was no longer accessible, the Host would enter the deprovisioning state, and be stuck there until the image became available (at which time it would be deprovisioned). Fix: An error in the provisioned state is now reported without triggering deprovisioning. Result: If the image become unavailable, the error will be reported but deprovisioning will not be initiated.	Story Points:	---
Clone Of:	1972374
Clones:	1972430 (view as bug list)		Environment:
Last Closed:	2021-07-27 23:12:54 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1972374
Bug Blocks:	1972430

Description Zane Bitter 2021-06-15 20:47:13 UTC

+++ This bug was initially created as a clone of Bug #1972374 +++

Description of problem:

https://bugzilla.redhat.com/show_bug.cgi?id=1971602 caused us to look at some failed-upgrade cases which resulted in a broken image cache - this results in adopt failure of the BMH resource in Ironic.

This should be a transient error (fixing the image cache should restore the BMH resources to fully managed status without any interruption to existing provisioned hosts)

However instead we see the BMH resources move to deprovisioning, which is not expected.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Disable the image cache, e.g by ssh to each master then move the cached image/checksum files under to some temporary different filename
2. Restart the metal3 pod
3. Wait some time, and observe the worker BMH resources move to deprovisioning

Actual results:

$ oc get bmh
NAME              STATE                    CONSUMER                      ONLINE   ERROR
ostest-master-0   externally provisioned   ostest-g6bq8-master-0         true     provisioned registration error
ostest-master-1   externally provisioned   ostest-g6bq8-master-1         true     provisioned registration error
ostest-master-2   externally provisioned   ostest-g6bq8-master-2         true     provisioned registration error
ostest-worker-0   deprovisioning           ostest-g6bq8-worker-0-xh2wn   true     provisioned registration error
ostest-worker-1   deprovisioning           ostest-g6bq8-worker-0-c9mtq   true     provisioned registration error

Expected results:

The BMH resources should not change state from provisioned, only the error should be reflected.

Additional info:

In CI we encountered this scenario due to a master host running out of disk space, the reproduce steps above were an emulation of such a scenario that made the images referenced by the BMH temporarily unavailable via the second-level cache

--- Additional comment from Steven Hardy on 2021-06-15 14:25:06 EDT ---

https://github.com/metal3-io/baremetal-operator/issues/915

https://github.com/metal3-io/baremetal-operator/pull/916

--- Additional comment from Zane Bitter on 2021-06-15 14:36:31 EDT ---

Adoption is triggered by losing the ironic DB - i.e. by a rescheduling of the metal3 Pod (common during an upgrade). So a combination of that and the image becoming unavailable would trigger the bug. When it is triggered, unless the failure is very brief there is a decent chance that all worker nodes will be deprovisioned.

Comment 2 Zane Bitter 2021-06-17 03:35:07 UTC

It appears that bug 1972572 will trigger this one on every upgrade from 4.7->4.8 (even if disk space is not exhausted) by deleting the cached image and not recreating it.

The good news is that it is perma-broken so that deprovisioning can't actually proceed, but that bad news is that the hosts will have moved to the deprovisioning state (so the Machine will become failed, potentially causing Node draining if MachineHealthCheck is enabled), and when a fix replaces the image deprovisioning of all workers will proceed.

So I agree that this is definitely a blocker for 4.8. There's no such mechanism known for 4.7, so no reason to panic there although obviously this should be fixed.

Comment 4 Ori Michaeli 2021-06-21 09:25:18 UTC

Verified with 4.8.0-0.nightly-2021-06-19-005119:

[kni@provisionhost-0-0 ~]$ oc get bmh -A 
NAMESPACE               NAME                   STATE                    CONSUMER                                  ONLINE   ERROR
openshift-machine-api   openshift-master-0-0   externally provisioned   ocp-edge-cluster-0-mtqwf-master-0         true     provisioned registration error
openshift-machine-api   openshift-master-0-1   externally provisioned   ocp-edge-cluster-0-mtqwf-master-1         true     provisioned registration error
openshift-machine-api   openshift-master-0-2   externally provisioned   ocp-edge-cluster-0-mtqwf-master-2         true     provisioned registration error
openshift-machine-api   openshift-worker-0-0   provisioned              ocp-edge-cluster-0-mtqwf-worker-0-rrz7k   true     provisioned registration error
openshift-machine-api   openshift-worker-0-1   provisioned              ocp-edge-cluster-0-mtqwf-worker-0-bgqm2   true     provisioned registration error

Comment 7 errata-xmlrpc 2021-07-27 23:12:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438