1972430 – Adopt failure can trigger deprovisioning

Bug 1972430 - Adopt failure can trigger deprovisioning

Summary: Adopt failure can trigger deprovisioning

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Steven Hardy
QA Contact:	Ori Michaeli
Docs Contact:
URL:
Whiteboard:
Depends On:	1972426 1976924
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-15 20:53 UTC by Zane Bitter
Modified:	2021-09-08 13:18 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1972426
Environment:
Last Closed:	2021-09-08 13:17:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift baremetal-operator pull 159	0	None	open	[release-4.7] Bug 1972430: Don't deprovision provisioned host due to error	2021-06-21 13:51:57 UTC
Red Hat Product Errata	RHSA-2021:3303	0	None	None	None	2021-09-08 13:18:17 UTC

Description Zane Bitter 2021-06-15 20:53:58 UTC

+++ This bug was initially created as a clone of Bug #1972426 +++

+++ This bug was initially created as a clone of Bug #1972374 +++

Description of problem:

https://bugzilla.redhat.com/show_bug.cgi?id=1971602 caused us to look at some failed-upgrade cases which resulted in a broken image cache - this results in adopt failure of the BMH resource in Ironic.

This should be a transient error (fixing the image cache should restore the BMH resources to fully managed status without any interruption to existing provisioned hosts)

However instead we see the BMH resources move to deprovisioning, which is not expected.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Disable the image cache, e.g by ssh to each master then move the cached image/checksum files under to some temporary different filename
2. Restart the metal3 pod
3. Wait some time, and observe the worker BMH resources move to deprovisioning

Actual results:

$ oc get bmh
NAME              STATE                    CONSUMER                      ONLINE   ERROR
ostest-master-0   externally provisioned   ostest-g6bq8-master-0         true     provisioned registration error
ostest-master-1   externally provisioned   ostest-g6bq8-master-1         true     provisioned registration error
ostest-master-2   externally provisioned   ostest-g6bq8-master-2         true     provisioned registration error
ostest-worker-0   deprovisioning           ostest-g6bq8-worker-0-xh2wn   true     provisioned registration error
ostest-worker-1   deprovisioning           ostest-g6bq8-worker-0-c9mtq   true     provisioned registration error

Expected results:

The BMH resources should not change state from provisioned, only the error should be reflected.

Additional info:

In CI we encountered this scenario due to a master host running out of disk space, the reproduce steps above were an emulation of such a scenario that made the images referenced by the BMH temporarily unavailable via the second-level cache

--- Additional comment from Steven Hardy on 2021-06-15 14:25:06 EDT ---

https://github.com/metal3-io/baremetal-operator/issues/915

https://github.com/metal3-io/baremetal-operator/pull/916

--- Additional comment from Zane Bitter on 2021-06-15 14:36:31 EDT ---

Adoption is triggered by losing the ironic DB - i.e. by a rescheduling of the metal3 Pod (common during an upgrade). So a combination of that and the image becoming unavailable would trigger the bug. When it is triggered, unless the failure is very brief there is a decent chance that all worker nodes will be deprovisioned.

Comment 1 Zane Bitter 2021-06-21 13:51:31 UTC

The bugzilla bot hates it when you depend on all of the previous bugs in the chain instead of just the one for the following release...

Comment 5 Zane Bitter 2021-06-28 15:04:07 UTC

I opened bug 1976924 for the reporting problem, which was due to the CRD in BMO getting out of sync with CBO so we can't save a "provisioned registration error".

We won't be able to verify this bug until that one is fixed.

Comment 10 errata-xmlrpc 2021-09-08 13:17:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.29 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3303

Note You need to log in before you can comment on or make changes to this bug.