Description of problem: MachineHealthCheck cannot be associated with a target machine if it is created AFTER the node becomes unhealthy(in this case kubelet is stopped). In this situation, if the kubelet is restarted, I could see that MachineHealthCheck is immediately associated with the machine. Most of the time, this is reproducible against the machines that were created at cluster installation time. Version-Release number of selected component (if applicable): 4.1.0-0.nightly-2019-04-28-064010 How reproducible: 90% Steps to Reproduce: 1. Create a feature-gate to enable MachineHealthCheck controller apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: name: machine-api spec: featureSet: TechPreviewNoUpgrade 2. Stop the kubelet from a node 3. Create a machinehealthcheck apiVersion: healthchecking.openshift.io/v1alpha1 kind: MachineHealthCheck metadata: name: workers-checka namespace: openshift-machine-api spec: selector: matchLabels: machine.openshift.io/cluster-api-cluster: jhou1-5kkht machine.openshift.io/cluster-api-machineset: jhou1-5kkht-worker-ap-northeast-1a 4. Monitor the controller log Actual results: The controller keeps showing the following message after a very long time until the node is recovered by starting the kubelet. ``` I0430 05:36:07.920152 1 machinehealthcheck_controller.go:112] Node ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh I0430 05:36:07.920282 1 machinehealthcheck_controller.go:151] Machine jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated ``` Expected results: Should show that MachineHealthCheck is initiated for the machine. Additional info:
``` I0430 05:36:07.920152 1 machinehealthcheck_controller.go:112] Node ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh I0430 05:36:07.920282 1 machinehealthcheck_controller.go:151] Machine jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated ``` Is this shown before you create the MHC? Is anything shown after you create the MHC and before restart kubelet? could you share logs and clarify time line? thanks!
(In reply to Alberto from comment #1) > ``` > I0430 05:36:07.920152 1 machinehealthcheck_controller.go:112] Node > ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine > openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh > I0430 05:36:07.920282 1 machinehealthcheck_controller.go:151] Machine > jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated > ``` > > Is this shown before you create the MHC? Yes this was shown before I created the MHC. My apologies that I didn't make the timeline clearer. I created the mhc after I stopped the kubelet. # oc get machinehealthcheck -o json|jq -r '.items[0].metadata.creationTimestamp' 2019-08-07T03:18:01Z # oc logs machine-api-controllers-b4cf6b5bb-n27r8 -c machine-healthcheck-controller -f | grep jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:17:45.643540 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:17:45.643587 1 machinehealthcheck_controller.go:141] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has no MachineHealthCheck associated > Is anything shown after you create the MHC and before restart kubelet? Nothing was shown at this period before restarting kubelet. According to the timestamp above, the machine-healthcheck-controller no longer printed logs about the stopped node(machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk). > could you share logs and clarify time line? thanks! After restarting kueblet, logs about the machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk is printed again ``` I0807 03:41:12.775642 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:12.775745 1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk W0807 03:41:12.876714 1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 876.704953ms. Requeuing... I0807 03:41:12.876945 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:12.877040 1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk W0807 03:41:12.877364 1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 877.339354ms. Requeuing... I0807 03:41:15.741970 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:15.743833 1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk W0807 03:41:15.744309 1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 3.744299689s. Requeuing... I0807 03:41:15.754335 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:15.754451 1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk W0807 03:41:15.754946 1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 3.754936238s. Requeuing... I0807 03:41:22.807297 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:22.809102 1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:22.809544 1 machinehealthcheck_controller.go:253] No remediaton action was taken. Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk with node ip-10-0-129-181.ap-northeast-1.compute.internal is healthy ```
verified. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.ci-2019-08-09-034758 True False 39m Cluster version is 4.2.0-0.ci-2019-08-09-034758 1. Enable MachineHealthCheck 2. Stop the kubelet from a node 3. Create a machinehealthcheck apiVersion: healthchecking.openshift.io/v1alpha1 kind: MachineHealthCheck metadata: name: example namespace: openshift-machine-api spec: selector: matchLabels: machine.openshift.io/cluster-api-cluster: zhsun3-fw4vn machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: zhsun3-fw4vn-worker-us-east-2a 4. Monitor the controller log I0809 08:09:58.109921 1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine zhsun3-fw4vn-worker-us-east-2a-zcgns I0809 08:09:58.210600 1 machinehealthcheck_controller.go:376] ConfigMap node-unhealthy-conditions not found under the namespace openshift-machine-api, fallback to default values: items: - name: Ready status: Unknown timeout: 300s - name: Ready status: "False" timeout: 300s W0809 08:09:58.210764 1 machinehealthcheck_controller.go:311] Machine zhsun3-fw4vn-worker-us-east-2a-zcgns has unhealthy node ip-10-0-133-109.us-east-2.compute.internal with the condition Ready and the timeout 300s for 17.210754152s. Requeuing... .. I0809 08:14:42.000974 1 machinehealthcheck_controller.go:301] Machine zhsun3-fw4vn-worker-us-east-2a-zcgns has been unhealthy for too long, deleting I0809 08:14:42.013713 1 disruption.go:446] Probably machine zhsun3-fw4vn-worker-us-east-2a-zcgns was deleted, uses a dummy machine to get MDB object E0809 08:14:42.013798 1 disruption.go:454] Unable to find MachineDisruptionBudget for machine zhsun3-fw4vn-worker-us-east-2a-zcgns I0809 08:14:42.013945 1 disruption.go:446] Probably machine zhsun3-fw4vn-worker-us-east-2a-zcgns was deleted, uses a dummy machine to get MDB object E0809 08:14:42.014022 1 disruption.go:454] Unable to find MachineDisruptionBudget for machine zhsun3-fw4vn-worker-us-east-2a-zcgns I0809 08:14:42.033644 1 machinehealthcheck_controller.go:90] Reconciling MachineHealthCheck triggered by /ip-10-0-133-109.us-east-2.compute.internal I0809 08:14:42.033703 1 machinehealthcheck_controller.go:113] Node ip-10-0-133-109.us-east-2.compute.internal is annotated with machine openshift-machine-api/zhsun3-fw4vn-worker-us-east-2a-zcgns I0809 08:14:42.033785 1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine zhsun3-fw4vn-worker-us-east-2a-zcgns 5. Check node $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-136-14.us-east-2.compute.internal Ready worker 18m v1.14.0+187359a55 ip-10-0-138-209.us-east-2.compute.internal Ready master 76m v1.14.0+187359a55 ip-10-0-145-242.us-east-2.compute.internal Ready worker 71m v1.14.0+187359a55 ip-10-0-158-148.us-east-2.compute.internal Ready master 76m v1.14.0+187359a55 ip-10-0-163-203.us-east-2.compute.internal Ready worker 27m v1.14.0+187359a55 ip-10-0-164-49.us-east-2.compute.internal Ready master 76m v1.14.0+187359a55
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922