Description of problem:
Have a machine enter an unhealthy state(stop hyperkube), machine health check triggers remediation, but the node can not be successfully deleted.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Enable TechPreviewNoUpgrade feature gate
2. Create MHC
3. Create a privileged pod to kill the hyperkube from the node with which MHC is associated.
4. Monitor the machine-healthcheck-controller log
After step 3:
After the node became 'NotReady', a new node is created and added to the cluster. But the unhealthy node can not be deleted. The machine-healthcheck-controller logged an event to be 'deleting' the node, but it never got deleted.
I0826 05:58:58.031790 1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine jhou-blmrh-w-a-l6mjs
I0826 05:58:58.032205 1 machinehealthcheck_controller.go:301] Machine jhou-blmrh-w-a-l6mjs has been unhealthy for too long, deleting
I0826 05:58:58.041371 1 machinehealthcheck_controller.go:90] Reconciling MachineHealthCheck triggered by /jhou-blmrh-w-a-l6mjs.c.openshift-gce-devel.internal
I0826 05:58:58.041422 1 machinehealthcheck_controller.go:113] Node jhou-blmrh-w-a-l6mjs.c.openshift-gce-devel.internal is annotated with machine openshift-machine-api/jhou-blmrh-w-a-l6mjs
I0826 05:58:58.042342 1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine jhou-blmrh-w-a-l6mjs
I0826 05:58:58.042749 1 machinehealthcheck_controller.go:301] Machine jhou-blmrh-w-a-l6mjs has been unhealthy for too long, deleting
I thought it might be cause by 1733474, but there isn't any message from the log about the node being drained. However, after annotating the node with machine.openshift.io/exclude-node-draining="", the node got deleted.
Unhealthy node is deleted.
Jianwei Hou, can you share the machine controller logs? Looks like a node is not properly drained or it just takes too long.
> I thought it might be cause by 1733474, but there isn't any message from the log about the node being drained.
machinehealthcheck_controller.go runs independently to the machine controller so you will see no message about node draining.
Jianwei Hou, how many worker nodes were available in your cluster before the machine requested to be deleted?
Based on our understanding this is a draining issue and is almost certainly a duplicate of:
I'm marking this as such. If you disagree, let us know.
*** This bug has been marked as a duplicate of bug 1733474 ***