Bug 1868104 - Baremetal actuator should not delete Machine objects
Summary: Baremetal actuator should not delete Machine objects
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.7.0
Assignee: Zane Bitter
QA Contact: Daniel
URL:
Whiteboard:
: 1840581 (view as bug list)
Depends On: 1901040 1909682
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-11 17:09 UTC by Zane Bitter
Modified: 2021-02-24 15:15 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The baremetal actuator in the CAPBM was written before that error handling code was available in the Cluster API. Therefore it handles a situation where the underlying Host is deleted by deleting the Machine object as well. Consequence: The intended operation of the Machine controller is that a Machine object is never deleted except by the user. If a Machine fails, it is put into a failed state and left for the machine remediation controller to try to recover, and the user to ultimately delete. Fix: 1. Set "InsufficientResourcesMachineError" on Machines that are searching (unsuccessfully) for an available host. This ensures that such Machines are the first victims on scale down 2. Move Machines into the "Failed" phase if the Host is deprovisioned 3. Don't delete failed Machines, leave this task to the MachineHealthCheck (see openshift/machine-api-operator#688) Result: Machine object no longer automatically deleted - see above for new process, as intended.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:15:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-baremetal pull 113 0 None closed Bug 1868104: Make use of errors and Failed phase to handle failed machines 2021-02-15 19:30:36 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:15:56 UTC

Description Zane Bitter 2020-08-11 17:09:53 UTC
Description of problem:
The intended operation of the Machine controller is that a Machine object is never deleted except by the user. If a Machine fails, it is put into a failed state and left for the machine remediation controller to try to recover, and the user to ultimately delete.

However, the baremetal actuator in the CAPBM was written before that error handling code was available in the Cluster API. Therefore it handles a situation where the underlying Host is deleted by deleting the Machine object as well.

It should stop doing that.

Potential complications:
* On baremetal, the Machine Remediation controller will attempt to remediate by rebooting, which obviously is not the appropriate way to handle the case where the Host has been deleted.
* Currently we only keep track of the Host assigned to the Machine by name, so if the Host gets deleted and recreated we might try to pick it up and provision it again. We'll need to record the UID somehow.

Comment 1 Andrew Beekhof 2020-08-12 03:10:14 UTC
(In reply to Zane Bitter from comment #0)

> Potential complications:
> * On baremetal, the Machine Remediation controller will attempt to remediate
> by rebooting, which obviously is not the appropriate way to handle the case
> where the Host has been deleted.

We're hoping to have an escalation path from reboot to deletion in 4.7

Comment 2 Beth White 2020-09-03 12:42:01 UTC
*** Bug 1840581 has been marked as a duplicate of this bug. ***

Comment 10 errata-xmlrpc 2021-02-24 15:15:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.