Description of problem: On an OSD 4.3.0 cluster had a infra node go NotReady. The underlying problem is a failed health check on the EC2 node. Ultimately I had to force stop the EC2 instance to get things to recover. I tried oc delete machine on the offending instance. This ended up creating a new Machine and Node but the old Machine and Node didn't terminate. This means workloads did not move. I had to go into EC2 and force stop / terminate the instance, at which point the cluster was able to delete the Machine and Node. Version-Release number of selected component (if applicable): 4.3.0 How reproducible: Unknown Steps to Reproduce: 1. 2. 3. Actual results: Node stuck in NotReady required manual intervention. Expected results: Node in a MachineSet with underlying EC2 status check failures is automatically replaced by the platform. Additional info:
> Node in a MachineSet with underlying EC2 status check failures is automatically replaced by the platform. For automatic node recovery a machine health check resource is needed https://docs.openshift.com/container-platform/4.3/machine_management/deploying-machine-health-checks.html In any case either deleting the machine automatically or manually draining get stuck because the node to be deleted is unreachable and stateful pods can't signal deletion appropriately https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/. You can force a machine to skip draining by setting the "machine.openshift.io/exclude-node-draining" annotation on it. This a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1803762 which already has a PR up. *** This bug has been marked as a duplicate of bug 1803762 ***