1868104 – Baremetal actuator should not delete Machine objects

Bug 1868104 - Baremetal actuator should not delete Machine objects

Summary: Baremetal actuator should not delete Machine objects

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Zane Bitter
QA Contact:	Daniel
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1840581 (view as bug list)
Depends On:	1901040 1909682
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-11 17:09 UTC by Zane Bitter
Modified:	2021-02-24 15:15 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The baremetal actuator in the CAPBM was written before that error handling code was available in the Cluster API. Therefore it handles a situation where the underlying Host is deleted by deleting the Machine object as well. Consequence: The intended operation of the Machine controller is that a Machine object is never deleted except by the user. If a Machine fails, it is put into a failed state and left for the machine remediation controller to try to recover, and the user to ultimately delete. Fix: 1. Set "InsufficientResourcesMachineError" on Machines that are searching (unsuccessfully) for an available host. This ensures that such Machines are the first victims on scale down 2. Move Machines into the "Failed" phase if the Host is deprovisioned 3. Don't delete failed Machines, leave this task to the MachineHealthCheck (see openshift/machine-api-operator#688) Result: Machine object no longer automatically deleted - see above for new process, as intended.
Clone Of:
Environment:
Last Closed:	2021-02-24 15:15:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-baremetal pull 113	0	None	closed	Bug 1868104: Make use of errors and Failed phase to handle failed machines	2021-02-15 19:30:36 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:15:56 UTC

Description Zane Bitter 2020-08-11 17:09:53 UTC

Description of problem:
The intended operation of the Machine controller is that a Machine object is never deleted except by the user. If a Machine fails, it is put into a failed state and left for the machine remediation controller to try to recover, and the user to ultimately delete.

However, the baremetal actuator in the CAPBM was written before that error handling code was available in the Cluster API. Therefore it handles a situation where the underlying Host is deleted by deleting the Machine object as well.

It should stop doing that.

Potential complications:
* On baremetal, the Machine Remediation controller will attempt to remediate by rebooting, which obviously is not the appropriate way to handle the case where the Host has been deleted.
* Currently we only keep track of the Host assigned to the Machine by name, so if the Host gets deleted and recreated we might try to pick it up and provision it again. We'll need to record the UID somehow.

Comment 1 Andrew Beekhof 2020-08-12 03:10:14 UTC

(In reply to Zane Bitter from comment #0)

> Potential complications:
> * On baremetal, the Machine Remediation controller will attempt to remediate
> by rebooting, which obviously is not the appropriate way to handle the case
> where the Host has been deleted.

We're hoping to have an escalation path from reboot to deletion in 4.7

Comment 2 Beth White 2020-09-03 12:42:01 UTC

*** Bug 1840581 has been marked as a duplicate of this bug. ***

Comment 10 errata-xmlrpc 2021-02-24 15:15:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.