Bug 1704587 - MachineHealthCheck cannot be associated with target machine if it is created after the node becomes unhealthy
Summary: MachineHealthCheck cannot be associated with target machine if it is created ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.2.0
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-30 07:14 UTC by Jianwei Hou
Modified: 2019-10-16 06:28 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:28:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 371 0 'None' closed Bug 1704587: Reconcile nodes on MHC changes 2020-08-21 07:20:05 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:28:33 UTC

Description Jianwei Hou 2019-04-30 07:14:37 UTC
Description of problem:
MachineHealthCheck cannot be associated with a target machine if it is created AFTER the node becomes unhealthy(in this case kubelet is stopped). In this situation, if the kubelet is restarted, I could see that MachineHealthCheck is immediately associated with the machine. 
Most of the time, this is reproducible against the machines that were created at cluster installation time.


Version-Release number of selected component (if applicable):
4.1.0-0.nightly-2019-04-28-064010

How reproducible:
90%

Steps to Reproduce:
1. Create a feature-gate to enable MachineHealthCheck controller

apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  name: machine-api
spec:
  featureSet: TechPreviewNoUpgrade

2. Stop the kubelet from a node

3. Create a machinehealthcheck

apiVersion: healthchecking.openshift.io/v1alpha1
kind: MachineHealthCheck
metadata:
  name: workers-checka
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: jhou1-5kkht
      machine.openshift.io/cluster-api-machineset: jhou1-5kkht-worker-ap-northeast-1a

4. Monitor the controller log

Actual results:
The controller keeps showing the following message after a very long time until the node is recovered by starting the kubelet.

```
I0430 05:36:07.920152       1 machinehealthcheck_controller.go:112] Node ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh
I0430 05:36:07.920282       1 machinehealthcheck_controller.go:151] Machine jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated
```


Expected results:
Should show that MachineHealthCheck is initiated for the machine.

Additional info:

Comment 1 Alberto 2019-08-06 17:50:34 UTC
```
I0430 05:36:07.920152       1 machinehealthcheck_controller.go:112] Node ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh
I0430 05:36:07.920282       1 machinehealthcheck_controller.go:151] Machine jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated
```

Is this shown before you create the MHC?
Is anything shown after you create the MHC and before restart kubelet?
could you share logs and clarify time line? thanks!

Comment 2 Jianwei Hou 2019-08-07 04:05:12 UTC
(In reply to Alberto from comment #1)
> ```
> I0430 05:36:07.920152       1 machinehealthcheck_controller.go:112] Node
> ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine
> openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh
> I0430 05:36:07.920282       1 machinehealthcheck_controller.go:151] Machine
> jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated
> ```
> 
> Is this shown before you create the MHC?
Yes this was shown before I created the MHC. My apologies that I didn't make the timeline clearer. I created the mhc after I stopped the kubelet.

# oc get machinehealthcheck -o json|jq -r '.items[0].metadata.creationTimestamp'
2019-08-07T03:18:01Z 

# oc logs machine-api-controllers-b4cf6b5bb-n27r8 -c machine-healthcheck-controller -f | grep jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:17:45.643540       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:17:45.643587       1 machinehealthcheck_controller.go:141] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has no MachineHealthCheck associated


> Is anything shown after you create the MHC and before restart kubelet?
Nothing was shown at this period before restarting kubelet.
According to the timestamp above, the machine-healthcheck-controller no longer printed logs about the stopped node(machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk).

> could you share logs and clarify time line? thanks!
After restarting kueblet, logs about the machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk is printed again

```
I0807 03:41:12.775642       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:12.775745       1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
W0807 03:41:12.876714       1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 876.704953ms. Requeuing...
I0807 03:41:12.876945       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:12.877040       1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
W0807 03:41:12.877364       1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 877.339354ms. Requeuing...
I0807 03:41:15.741970       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:15.743833       1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
W0807 03:41:15.744309       1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 3.744299689s. Requeuing...
I0807 03:41:15.754335       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:15.754451       1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
W0807 03:41:15.754946       1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 3.754936238s. Requeuing...
I0807 03:41:22.807297       1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:22.809102       1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk
I0807 03:41:22.809544       1 machinehealthcheck_controller.go:253] No remediaton action was taken. Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk with node ip-10-0-129-181.ap-northeast-1.compute.internal is healthy
```

Comment 4 sunzhaohua 2019-08-09 08:40:13 UTC
verified.

$ oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.ci-2019-08-09-034758   True        False         39m     Cluster version is 4.2.0-0.ci-2019-08-09-034758

1. Enable MachineHealthCheck 
2. Stop the kubelet from a node
3. Create a machinehealthcheck

apiVersion: healthchecking.openshift.io/v1alpha1
kind: MachineHealthCheck
metadata:
  name: example
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: zhsun3-fw4vn
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: zhsun3-fw4vn-worker-us-east-2a

4. Monitor the controller log

I0809 08:09:58.109921       1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:09:58.210600       1 machinehealthcheck_controller.go:376] ConfigMap node-unhealthy-conditions not found under the namespace openshift-machine-api, fallback to default values: items:
- name: Ready
  status: Unknown
  timeout: 300s
- name: Ready
  status: "False"
  timeout: 300s
W0809 08:09:58.210764       1 machinehealthcheck_controller.go:311] Machine zhsun3-fw4vn-worker-us-east-2a-zcgns has unhealthy node ip-10-0-133-109.us-east-2.compute.internal with the condition Ready and the timeout 300s for 17.210754152s. Requeuing...
..

I0809 08:14:42.000974       1 machinehealthcheck_controller.go:301] Machine zhsun3-fw4vn-worker-us-east-2a-zcgns has been unhealthy for too long, deleting
I0809 08:14:42.013713       1 disruption.go:446] Probably machine zhsun3-fw4vn-worker-us-east-2a-zcgns was deleted, uses a dummy machine to get MDB object
E0809 08:14:42.013798       1 disruption.go:454] Unable to find MachineDisruptionBudget for machine zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:14:42.013945       1 disruption.go:446] Probably machine zhsun3-fw4vn-worker-us-east-2a-zcgns was deleted, uses a dummy machine to get MDB object
E0809 08:14:42.014022       1 disruption.go:454] Unable to find MachineDisruptionBudget for machine zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:14:42.033644       1 machinehealthcheck_controller.go:90] Reconciling MachineHealthCheck triggered by /ip-10-0-133-109.us-east-2.compute.internal
I0809 08:14:42.033703       1 machinehealthcheck_controller.go:113] Node ip-10-0-133-109.us-east-2.compute.internal is annotated with machine openshift-machine-api/zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:14:42.033785       1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine zhsun3-fw4vn-worker-us-east-2a-zcgns


5. Check node

$ oc get node
NAME                                         STATUS                        ROLES    AGE   VERSION
ip-10-0-136-14.us-east-2.compute.internal    Ready                         worker   18m   v1.14.0+187359a55
ip-10-0-138-209.us-east-2.compute.internal   Ready                         master   76m   v1.14.0+187359a55
ip-10-0-145-242.us-east-2.compute.internal   Ready                         worker   71m   v1.14.0+187359a55
ip-10-0-158-148.us-east-2.compute.internal   Ready                         master   76m   v1.14.0+187359a55
ip-10-0-163-203.us-east-2.compute.internal   Ready                         worker   27m   v1.14.0+187359a55
ip-10-0-164-49.us-east-2.compute.internal    Ready                         master   76m   v1.14.0+187359a55

Comment 5 errata-xmlrpc 2019-10-16 06:28:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.