Description of problem: In Baremetal IPI environments, node remediation is handled externally by metal3 pod. When a BMH object is annotated (reboot.metal3.io: '{"mode":"hard"}') by MHC, the node fenced and rebooted by metal3 pod. However, if metal3 is not running when the BMH objects are annotated by MHC as above, then external remediation is not happening and BMH objects stays annotated even after their nodes become Ready. Especially for the baremetal deployments, node reboots might take a lot longer than it takes on cloud infras. During cluster upgrades nodes are rebooted by machine-config operator one after another in a rolling fashion and if it takes longer to reboot because of some reason, MHC annotates these nodes to be remediated. If the metal3 pod is not running, then nodes reboots successfully after some time and becomes ready but its corresponding BMH objects stays annotated. After some time, when the cluster if fully up and running, if metal3 pod starts running, it fences all the annotated baremetal nodes blindly causing a full outage for the whole cluster. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Stop the metal3 pod and keep it stopped. 2. Reboot nodes and introduce some delay to have it annotated by MHC to be remediated 3. Repeat step 2 for other nodes after the rebooted node becomes Ready 4. See on the OCP Web UI events, some nodes are powering off although all the nodes are up and Ready 5. Start metal3 pod and see all the nodes fenced and rebooted at the same time Actual results: MHC does not consider nodes that still have the annotation when evaluating unhealthy nodes and continues to annotate BMH nodes even if the external remediation is not happening. Expected results: MHC should remove the external remediation annotation from the BMH objects which becomes Ready, must also consider nodes that still have the annotation when evaluating unhealthy nodes and stop annotating more BMH objects if the external remediation is not happening. Additional info:
Why would the metal3 Pod not be running?
Hi Amit, I dont have a bm cluster, could you help to test this? thanks!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759