Description of problem: The intended operation of the Machine controller is that a Machine object is never deleted except by the user. If a Machine fails, it is put into a failed state and left for the machine remediation controller to try to recover, and the user to ultimately delete. However, the baremetal actuator in the CAPBM was written before that error handling code was available in the Cluster API. Therefore it handles a situation where the underlying Host is deleted by deleting the Machine object as well. It should stop doing that. Potential complications: * On baremetal, the Machine Remediation controller will attempt to remediate by rebooting, which obviously is not the appropriate way to handle the case where the Host has been deleted. * Currently we only keep track of the Host assigned to the Machine by name, so if the Host gets deleted and recreated we might try to pick it up and provision it again. We'll need to record the UID somehow.
(In reply to Zane Bitter from comment #0) > Potential complications: > * On baremetal, the Machine Remediation controller will attempt to remediate > by rebooting, which obviously is not the appropriate way to handle the case > where the Host has been deleted. We're hoping to have an escalation path from reboot to deletion in 4.7
*** Bug 1840581 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633