Bug 1868104

Summary: Baremetal actuator should not delete Machine objects
Product: OpenShift Container Platform Reporter: Zane Bitter <zbitter>
Component: Cloud ComputeAssignee: Zane Bitter <zbitter>
Cloud Compute sub component: BareMetal Provider QA Contact: Daniel <dmaizel>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: medium CC: abeekhof, beth.white, dhellmann, mgugino, msluiter, nyehia, sdasu, shardy, stbenjam, vsibirsk
Version: 4.6Keywords: TestBlockerForLayeredProduct, Triaged
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The baremetal actuator in the CAPBM was written before that error handling code was available in the Cluster API. Therefore it handles a situation where the underlying Host is deleted by deleting the Machine object as well. Consequence: The intended operation of the Machine controller is that a Machine object is never deleted except by the user. If a Machine fails, it is put into a failed state and left for the machine remediation controller to try to recover, and the user to ultimately delete. Fix: 1. Set "InsufficientResourcesMachineError" on Machines that are searching (unsuccessfully) for an available host. This ensures that such Machines are the first victims on scale down 2. Move Machines into the "Failed" phase if the Host is deprovisioned 3. Don't delete failed Machines, leave this task to the MachineHealthCheck (see openshift/machine-api-operator#688) Result: Machine object no longer automatically deleted - see above for new process, as intended.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:15:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1901040, 1909682    
Bug Blocks:    

Description Zane Bitter 2020-08-11 17:09:53 UTC
Description of problem:
The intended operation of the Machine controller is that a Machine object is never deleted except by the user. If a Machine fails, it is put into a failed state and left for the machine remediation controller to try to recover, and the user to ultimately delete.

However, the baremetal actuator in the CAPBM was written before that error handling code was available in the Cluster API. Therefore it handles a situation where the underlying Host is deleted by deleting the Machine object as well.

It should stop doing that.

Potential complications:
* On baremetal, the Machine Remediation controller will attempt to remediate by rebooting, which obviously is not the appropriate way to handle the case where the Host has been deleted.
* Currently we only keep track of the Host assigned to the Machine by name, so if the Host gets deleted and recreated we might try to pick it up and provision it again. We'll need to record the UID somehow.

Comment 1 Andrew Beekhof 2020-08-12 03:10:14 UTC
(In reply to Zane Bitter from comment #0)

> Potential complications:
> * On baremetal, the Machine Remediation controller will attempt to remediate
> by rebooting, which obviously is not the appropriate way to handle the case
> where the Host has been deleted.

We're hoping to have an escalation path from reboot to deletion in 4.7

Comment 2 Beth White 2020-09-03 12:42:01 UTC
*** Bug 1840581 has been marked as a duplicate of this bug. ***

Comment 10 errata-xmlrpc 2021-02-24 15:15:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633