1745420 – [GCP][MHC] Unhealthy node could not be deleted

Bug 1745420 - [GCP][MHC] Unhealthy node could not be deleted

Summary: [GCP][MHC] Unhealthy node could not be deleted

Keywords:
Status:	CLOSED DUPLICATE of bug 1733474
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Jan Chaloupka
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-26 06:25 UTC by Jianwei Hou
Modified:	2019-10-28 09:52 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-27 13:53:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jianwei Hou 2019-08-26 06:25:32 UTC

Description of problem:
Have a machine enter an unhealthy state(stop hyperkube), machine health check triggers remediation, but the node can not be successfully deleted. 

Version-Release number of selected component (if applicable):
openshift-machine-api/jhou-blmrh-w-b-drppv

How reproducible:
Always

Steps to Reproduce:
1. Enable TechPreviewNoUpgrade feature gate

apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  name: cluster
spec:
  featureSet: TechPreviewNoUpgrade

2. Create MHC
apiVersion: healthchecking.openshift.io/v1alpha1
kind: MachineHealthCheck
metadata:
  name: mhc
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: jhou-blmrh
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: jhou-blmrh-w-a

3. Create a privileged pod to kill the hyperkube from the node with which MHC is associated.

4. Monitor the machine-healthcheck-controller log

Actual results:

After step 3:
After the node became 'NotReady', a new node is created and added to the cluster. But the unhealthy node can not be deleted. The machine-healthcheck-controller logged an event to be 'deleting' the node, but it never got deleted. 

```
I0826 05:58:58.031790       1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine jhou-blmrh-w-a-l6mjs
I0826 05:58:58.032205       1 machinehealthcheck_controller.go:301] Machine jhou-blmrh-w-a-l6mjs has been unhealthy for too long, deleting
I0826 05:58:58.041371       1 machinehealthcheck_controller.go:90] Reconciling MachineHealthCheck triggered by /jhou-blmrh-w-a-l6mjs.c.openshift-gce-devel.internal
I0826 05:58:58.041422       1 machinehealthcheck_controller.go:113] Node jhou-blmrh-w-a-l6mjs.c.openshift-gce-devel.internal is annotated with machine openshift-machine-api/jhou-blmrh-w-a-l6mjs
I0826 05:58:58.042342       1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine jhou-blmrh-w-a-l6mjs
I0826 05:58:58.042749       1 machinehealthcheck_controller.go:301] Machine jhou-blmrh-w-a-l6mjs has been unhealthy for too long, deleting
```

I thought it might be cause by 1733474, but there isn't any message from the log about the node being drained. However, after annotating the node with machine.openshift.io/exclude-node-draining="", the node got deleted. 

Expected results:
Unhealthy node is deleted.

Additional info:

Comment 1 Jan Chaloupka 2019-08-26 10:11:06 UTC

Jianwei Hou, can you share the machine controller logs? Looks like a node is not properly drained or it just takes too long.

> I thought it might be cause by 1733474, but there isn't any message from the log about the node being drained.

machinehealthcheck_controller.go runs independently to the machine controller so you will see no message about node draining.

Comment 2 Jan Chaloupka 2019-08-27 12:13:53 UTC

Jianwei Hou, how many worker nodes were available in your cluster before the machine requested to be deleted?

Comment 3 Brad Ison 2019-08-27 13:53:01 UTC

Based on our understanding this is a draining issue and is almost certainly a duplicate of:
https://bugzilla.redhat.com/show_bug.cgi?id=1733474

I'm marking this as such. If you disagree, let us know.

*** This bug has been marked as a duplicate of bug 1733474 ***

Note You need to log in before you can comment on or make changes to this bug.