Bug 1704587

Summary:	MachineHealthCheck cannot be associated with target machine if it is created after the node becomes unhealthy
Product:	OpenShift Container Platform	Reporter:	Jianwei Hou <jhou>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Status:	CLOSED ERRATA	QA Contact:	Jianwei Hou <jhou>
Severity:	low	Docs Contact:
Priority:	medium
Version:	4.1.0	CC:	agarcial, pthomas, zhsun
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:28:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jianwei Hou 2019-04-30 07:14:37 UTC

Description of problem:
MachineHealthCheck cannot be associated with a target machine if it is created AFTER the node becomes unhealthy(in this case kubelet is stopped). In this situation, if the kubelet is restarted, I could see that MachineHealthCheck is immediately associated with the machine.
Most of the time, this is reproducible against the machines that were created at cluster installation time.

Version-Release number of selected component (if applicable):
4.1.0-0.nightly-2019-04-28-064010

How reproducible:
90%

Steps to Reproduce:
1. Create a feature-gate to enable MachineHealthCheck controller

apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
name: machine-api
spec:
featureSet: TechPreviewNoUpgrade

2. Stop the kubelet from a node

3. Create a machinehealthcheck

apiVersion: healthchecking.openshift.io/v1alpha1
kind: MachineHealthCheck
metadata:
name: workers-checka
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: jhou1-5kkht
machine.openshift.io/cluster-api-machineset: jhou1-5kkht-worker-ap-northeast-1a

4. Monitor the controller log

Actual results:
The controller keeps showing the following message after a very long time until the node is recovered by starting the kubelet.

```
I0430 05:36:07.920152 1 machinehealthcheck_controller.go:112] Node ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh
I0430 05:36:07.920282 1 machinehealthcheck_controller.go:151] Machine jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated
```

Expected results:
Should show that MachineHealthCheck is initiated for the machine.

Additional info:

Comment 1 Alberto 2019-08-06 17:50:34 UTC

```
I0430 05:36:07.920152       1 machinehealthcheck_controller.go:112] Node ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh
I0430 05:36:07.920282       1 machinehealthcheck_controller.go:151] Machine jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated
```

Is this shown before you create the MHC?
Is anything shown after you create the MHC and before restart kubelet?
could you share logs and clarify time line? thanks!

Comment 2 Jianwei Hou 2019-08-07 04:05:12 UTC

(In reply to Alberto from comment #1)
> ```
> I0430 05:36:07.920152       1 machinehealthcheck_controller.go:112] Node
> ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine
> openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh
> I0430 05:36:07.920282       1 machinehealthcheck_controller.go:151] Machine
> jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated
> ```
> 
> Is this shown before you create the MHC?
Yes this was shown before I created the MHC. My apologies that I didn't make the timeline clearer. I created the mhc after I stopped the kubelet.

# oc get machinehealthcheck -o json|jq -r '.items[0].metadata.creationTimestamp'
2019-08-07T03:18:01Z 

# oc logs machine-api-controllers-b4cf6b5bb-n27r8 -c machine-healthcheck-controller -f | grep jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:17:45.643540       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:17:45.643587       1 machinehealthcheck_controller.go:141] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has no MachineHealthCheck associated


> Is anything shown after you create the MHC and before restart kubelet?
Nothing was shown at this period before restarting kubelet.
According to the timestamp above, the machine-healthcheck-controller no longer printed logs about the stopped node(machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk).

> could you share logs and clarify time line? thanks!
After restarting kueblet, logs about the machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk is printed again

```
I0807 03:41:12.775642       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:12.775745       1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
W0807 03:41:12.876714       1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 876.704953ms. Requeuing...
I0807 03:41:12.876945       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:12.877040       1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
W0807 03:41:12.877364       1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 877.339354ms. Requeuing...
I0807 03:41:15.741970       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:15.743833       1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
W0807 03:41:15.744309       1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 3.744299689s. Requeuing...
I0807 03:41:15.754335       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:15.754451       1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
W0807 03:41:15.754946       1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 3.754936238s. Requeuing...
I0807 03:41:22.807297       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:22.809102       1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:22.809544       1 machinehealthcheck_controller.go:253] No remediaton action was taken. Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk with node ip-10-0-129-181.ap-northeast-1.compute.internal is healthy
```

Comment 4 sunzhaohua 2019-08-09 08:40:13 UTC

verified.

$ oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.ci-2019-08-09-034758   True        False         39m     Cluster version is 4.2.0-0.ci-2019-08-09-034758

1. Enable MachineHealthCheck 
2. Stop the kubelet from a node
3. Create a machinehealthcheck

apiVersion: healthchecking.openshift.io/v1alpha1
kind: MachineHealthCheck
metadata:
  name: example
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: zhsun3-fw4vn
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: zhsun3-fw4vn-worker-us-east-2a

4. Monitor the controller log

I0809 08:09:58.109921       1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:09:58.210600       1 machinehealthcheck_controller.go:376] ConfigMap node-unhealthy-conditions not found under the namespace openshift-machine-api, fallback to default values: items:
- name: Ready
  status: Unknown
  timeout: 300s
- name: Ready
  status: "False"
  timeout: 300s
W0809 08:09:58.210764       1 machinehealthcheck_controller.go:311] Machine zhsun3-fw4vn-worker-us-east-2a-zcgns has unhealthy node ip-10-0-133-109.us-east-2.compute.internal with the condition Ready and the timeout 300s for 17.210754152s. Requeuing...
..

I0809 08:14:42.000974       1 machinehealthcheck_controller.go:301] Machine zhsun3-fw4vn-worker-us-east-2a-zcgns has been unhealthy for too long, deleting
I0809 08:14:42.013713       1 disruption.go:446] Probably machine zhsun3-fw4vn-worker-us-east-2a-zcgns was deleted, uses a dummy machine to get MDB object
E0809 08:14:42.013798       1 disruption.go:454] Unable to find MachineDisruptionBudget for machine zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:14:42.013945       1 disruption.go:446] Probably machine zhsun3-fw4vn-worker-us-east-2a-zcgns was deleted, uses a dummy machine to get MDB object
E0809 08:14:42.014022       1 disruption.go:454] Unable to find MachineDisruptionBudget for machine zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:14:42.033644       1 machinehealthcheck_controller.go:90] Reconciling MachineHealthCheck triggered by /ip-10-0-133-109.us-east-2.compute.internal
I0809 08:14:42.033703       1 machinehealthcheck_controller.go:113] Node ip-10-0-133-109.us-east-2.compute.internal is annotated with machine openshift-machine-api/zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:14:42.033785       1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine zhsun3-fw4vn-worker-us-east-2a-zcgns


5. Check node

$ oc get node
NAME                                         STATUS                        ROLES    AGE   VERSION
ip-10-0-136-14.us-east-2.compute.internal    Ready                         worker   18m   v1.14.0+187359a55
ip-10-0-138-209.us-east-2.compute.internal   Ready                         master   76m   v1.14.0+187359a55
ip-10-0-145-242.us-east-2.compute.internal   Ready                         worker   71m   v1.14.0+187359a55
ip-10-0-158-148.us-east-2.compute.internal   Ready                         master   76m   v1.14.0+187359a55
ip-10-0-163-203.us-east-2.compute.internal   Ready                         worker   27m   v1.14.0+187359a55
ip-10-0-164-49.us-east-2.compute.internal    Ready                         master   76m   v1.14.0+187359a55

Comment 5 errata-xmlrpc 2019-10-16 06:28:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922