Bug 1975296 - machinehealthcheck controller does not consider nodes that still have the external remediation annotation
Summary: machinehealthcheck controller does not consider nodes that still have the ext...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Marc Sluiter
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks: 2000038
TreeView+ depends on / blocked
 
Reported: 2021-06-23 12:04 UTC by kseremet
Modified: 2021-10-18 17:36 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:36:21 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-baremetal pull 157 0 None open Bug 1975296: Handle the case that external remediation annotation is removed from a recovered machine 2021-08-30 11:22:19 UTC
Github openshift machine-api-operator pull 891 0 None open Bug 1975296: Remove external remediation annotation for healthy machines 2021-08-30 11:22:23 UTC
Github openshift machine-api-operator pull 898 0 None None None 2021-08-30 11:22:24 UTC
Github openshift machine-api-operator pull 902 0 None None None 2021-08-30 11:22:25 UTC
Github openshift openshift-docs pull 33739 0 None open MHC: Improve explanation of maxUnhealthy 2021-08-30 11:22:26 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:36:40 UTC

Description kseremet 2021-06-23 12:04:12 UTC
Description of problem:

In Baremetal IPI environments, node remediation is handled externally by metal3 pod. When a BMH object is annotated (reboot.metal3.io: '{"mode":"hard"}') by MHC,  the node fenced and rebooted by metal3 pod.

However, if metal3 is not running when the BMH objects are annotated by MHC as above, then external remediation is not happening and BMH objects stays annotated even after their nodes become Ready.

Especially for the baremetal deployments, node reboots might take a lot longer than it takes on cloud infras. During cluster upgrades nodes are rebooted by machine-config operator one after another in a rolling fashion and if it takes longer to reboot because of some reason, MHC annotates these nodes to be remediated. If the metal3 pod is not running, then nodes reboots successfully after some time and becomes ready but its corresponding BMH objects stays annotated.
After some time, when the cluster if fully up and running, if metal3 pod starts running, it fences all the annotated baremetal nodes blindly causing a full outage for the whole cluster.


Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Stop the metal3 pod and keep it stopped.
2. Reboot nodes and introduce some delay to have it annotated by MHC to be remediated
3. Repeat step 2 for other nodes after the rebooted node becomes Ready
4. See on the OCP Web UI events, some nodes are powering off although all the nodes are up and Ready
5. Start metal3 pod and see all the nodes fenced and rebooted at the same time 

Actual results:

MHC does not consider nodes that still have the annotation when evaluating unhealthy nodes and continues to annotate BMH nodes even if the external remediation is not happening.

Expected results:

MHC should remove the external remediation annotation from the BMH objects which becomes Ready, must also consider nodes that still have the annotation when evaluating unhealthy nodes and stop annotating more BMH objects if the external remediation is not happening.

Additional info:

Comment 3 Zane Bitter 2021-08-05 13:33:34 UTC
Why would the metal3 Pod not be running?

Comment 9 sunzhaohua 2021-08-19 06:52:19 UTC
Hi Amit, I dont have a bm cluster, could you help to test this? thanks!

Comment 15 errata-xmlrpc 2021-10-18 17:36:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.