1975296 – machinehealthcheck controller does not consider nodes that still have the external remediation annotation

Bug 1975296 - machinehealthcheck controller does not consider nodes that still have the external remediation annotation

Summary: machinehealthcheck controller does not consider nodes that still have the ext...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Marc Sluiter
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2000038
TreeView+	depends on / blocked

Reported:	2021-06-23 12:04 UTC by kseremet
Modified:	2024-10-01 18:45 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:36:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-baremetal pull 157	None	open	Bug 1975296: Handle the case that external remediation annotation is removed from a recovered machine	2021-08-30 11:22:19 UTC
Github	openshift machine-api-operator pull 891	None	open	Bug 1975296: Remove external remediation annotation for healthy machines	2021-08-30 11:22:23 UTC
Github	openshift machine-api-operator pull 898	None	None	None	2021-08-30 11:22:24 UTC
Github	openshift machine-api-operator pull 902	None	None	None	2021-08-30 11:22:25 UTC
Github	openshift openshift-docs pull 33739	None	open	MHC: Improve explanation of maxUnhealthy	2021-08-30 11:22:26 UTC
Red Hat Product Errata	RHSA-2021:3759	None	None	None	2021-10-18 17:36:40 UTC

Description kseremet 2021-06-23 12:04:12 UTC

Description of problem:

In Baremetal IPI environments, node remediation is handled externally by metal3 pod. When a BMH object is annotated (reboot.metal3.io: '{"mode":"hard"}') by MHC, the node fenced and rebooted by metal3 pod.

However, if metal3 is not running when the BMH objects are annotated by MHC as above, then external remediation is not happening and BMH objects stays annotated even after their nodes become Ready.

Especially for the baremetal deployments, node reboots might take a lot longer than it takes on cloud infras. During cluster upgrades nodes are rebooted by machine-config operator one after another in a rolling fashion and if it takes longer to reboot because of some reason, MHC annotates these nodes to be remediated. If the metal3 pod is not running, then nodes reboots successfully after some time and becomes ready but its corresponding BMH objects stays annotated.
After some time, when the cluster if fully up and running, if metal3 pod starts running, it fences all the annotated baremetal nodes blindly causing a full outage for the whole cluster.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Stop the metal3 pod and keep it stopped.
2. Reboot nodes and introduce some delay to have it annotated by MHC to be remediated
3. Repeat step 2 for other nodes after the rebooted node becomes Ready
4. See on the OCP Web UI events, some nodes are powering off although all the nodes are up and Ready
5. Start metal3 pod and see all the nodes fenced and rebooted at the same time

Actual results:

MHC does not consider nodes that still have the annotation when evaluating unhealthy nodes and continues to annotate BMH nodes even if the external remediation is not happening.

Expected results:

MHC should remove the external remediation annotation from the BMH objects which becomes Ready, must also consider nodes that still have the annotation when evaluating unhealthy nodes and stop annotating more BMH objects if the external remediation is not happening.

Additional info:

Comment 3 Zane Bitter 2021-08-05 13:33:34 UTC

Why would the metal3 Pod not be running?

Comment 9 sunzhaohua 2021-08-19 06:52:19 UTC

Hi Amit, I dont have a bm cluster, could you help to test this? thanks!

Comment 15 errata-xmlrpc 2021-10-18 17:36:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.