1805874 – Node in MachineSet failing EC2 health check goes NotReady and doesn't automatically recover

Bug 1805874 - Node in MachineSet failing EC2 health check goes NotReady and doesn't automatically recover

Summary: Node in MachineSet failing EC2 health check goes NotReady and doesn't automat...

Keywords:
Status:	CLOSED DUPLICATE of bug 1803762
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-21 16:51 UTC by Naveen Malik
Modified:	2020-02-24 08:54 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-02-24 08:54:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Naveen Malik 2020-02-21 16:51:31 UTC

Description of problem:
On an OSD 4.3.0 cluster had a infra node go NotReady.  The underlying problem is a failed health check on the EC2 node.  Ultimately I had to force stop the EC2 instance to get things to recover.

I tried oc delete machine on the offending instance.  This ended up creating a new Machine and Node but the old Machine and Node didn't terminate.  This means workloads did not move.

I had to go into EC2 and force stop / terminate the instance, at which point the cluster was able to delete the Machine and Node.

Version-Release number of selected component (if applicable):
4.3.0

How reproducible:
Unknown

Steps to Reproduce:
1.
2.
3.

Actual results:
Node stuck in NotReady required manual intervention.

Expected results:
Node in a MachineSet with underlying EC2 status check failures is automatically replaced by the platform.


Additional info:

Comment 2 Alberto 2020-02-24 08:54:52 UTC

> Node in a MachineSet with underlying EC2 status check failures is automatically replaced by the platform.

For automatic node recovery a machine health check resource is needed https://docs.openshift.com/container-platform/4.3/machine_management/deploying-machine-health-checks.html

In any case either deleting the machine automatically or manually draining get stuck because the node to be deleted is unreachable and stateful pods can't signal deletion appropriately https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/.
You can force a machine to skip draining by setting the "machine.openshift.io/exclude-node-draining" annotation on it.

This a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1803762 which already has a PR up.

*** This bug has been marked as a duplicate of bug 1803762 ***

Note You need to log in before you can comment on or make changes to this bug.