Description of problem: Have a machine enter an unhealthy state(stop hyperkube), machine health check triggers remediation, but the node can not be successfully deleted. Version-Release number of selected component (if applicable): openshift-machine-api/jhou-blmrh-w-b-drppv How reproducible: Always Steps to Reproduce: 1. Enable TechPreviewNoUpgrade feature gate apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: name: cluster spec: featureSet: TechPreviewNoUpgrade 2. Create MHC apiVersion: healthchecking.openshift.io/v1alpha1 kind: MachineHealthCheck metadata: name: mhc spec: selector: matchLabels: machine.openshift.io/cluster-api-cluster: jhou-blmrh machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: jhou-blmrh-w-a 3. Create a privileged pod to kill the hyperkube from the node with which MHC is associated. 4. Monitor the machine-healthcheck-controller log Actual results: After step 3: After the node became 'NotReady', a new node is created and added to the cluster. But the unhealthy node can not be deleted. The machine-healthcheck-controller logged an event to be 'deleting' the node, but it never got deleted. ``` I0826 05:58:58.031790 1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine jhou-blmrh-w-a-l6mjs I0826 05:58:58.032205 1 machinehealthcheck_controller.go:301] Machine jhou-blmrh-w-a-l6mjs has been unhealthy for too long, deleting I0826 05:58:58.041371 1 machinehealthcheck_controller.go:90] Reconciling MachineHealthCheck triggered by /jhou-blmrh-w-a-l6mjs.c.openshift-gce-devel.internal I0826 05:58:58.041422 1 machinehealthcheck_controller.go:113] Node jhou-blmrh-w-a-l6mjs.c.openshift-gce-devel.internal is annotated with machine openshift-machine-api/jhou-blmrh-w-a-l6mjs I0826 05:58:58.042342 1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine jhou-blmrh-w-a-l6mjs I0826 05:58:58.042749 1 machinehealthcheck_controller.go:301] Machine jhou-blmrh-w-a-l6mjs has been unhealthy for too long, deleting ``` I thought it might be cause by 1733474, but there isn't any message from the log about the node being drained. However, after annotating the node with machine.openshift.io/exclude-node-draining="", the node got deleted. Expected results: Unhealthy node is deleted. Additional info:
Jianwei Hou, can you share the machine controller logs? Looks like a node is not properly drained or it just takes too long. > I thought it might be cause by 1733474, but there isn't any message from the log about the node being drained. machinehealthcheck_controller.go runs independently to the machine controller so you will see no message about node draining.
Jianwei Hou, how many worker nodes were available in your cluster before the machine requested to be deleted?
Based on our understanding this is a draining issue and is almost certainly a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=1733474 I'm marking this as such. If you disagree, let us know. *** This bug has been marked as a duplicate of bug 1733474 ***