Bug 1704587
| Summary: | MachineHealthCheck cannot be associated with target machine if it is created after the node becomes unhealthy | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jianwei Hou <jhou> |
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> |
| Status: | CLOSED ERRATA | QA Contact: | Jianwei Hou <jhou> |
| Severity: | low | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.1.0 | CC: | agarcial, pthomas, zhsun |
| Target Milestone: | --- | ||
| Target Release: | 4.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-16 06:28:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
``` I0430 05:36:07.920152 1 machinehealthcheck_controller.go:112] Node ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh I0430 05:36:07.920282 1 machinehealthcheck_controller.go:151] Machine jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated ``` Is this shown before you create the MHC? Is anything shown after you create the MHC and before restart kubelet? could you share logs and clarify time line? thanks! (In reply to Alberto from comment #1) > ``` > I0430 05:36:07.920152 1 machinehealthcheck_controller.go:112] Node > ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine > openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh > I0430 05:36:07.920282 1 machinehealthcheck_controller.go:151] Machine > jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated > ``` > > Is this shown before you create the MHC? Yes this was shown before I created the MHC. My apologies that I didn't make the timeline clearer. I created the mhc after I stopped the kubelet. # oc get machinehealthcheck -o json|jq -r '.items[0].metadata.creationTimestamp' 2019-08-07T03:18:01Z # oc logs machine-api-controllers-b4cf6b5bb-n27r8 -c machine-healthcheck-controller -f | grep jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:17:45.643540 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:17:45.643587 1 machinehealthcheck_controller.go:141] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has no MachineHealthCheck associated > Is anything shown after you create the MHC and before restart kubelet? Nothing was shown at this period before restarting kubelet. According to the timestamp above, the machine-healthcheck-controller no longer printed logs about the stopped node(machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk). > could you share logs and clarify time line? thanks! After restarting kueblet, logs about the machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk is printed again ``` I0807 03:41:12.775642 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:12.775745 1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk W0807 03:41:12.876714 1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 876.704953ms. Requeuing... I0807 03:41:12.876945 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:12.877040 1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk W0807 03:41:12.877364 1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 877.339354ms. Requeuing... I0807 03:41:15.741970 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:15.743833 1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk W0807 03:41:15.744309 1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 3.744299689s. Requeuing... I0807 03:41:15.754335 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:15.754451 1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk W0807 03:41:15.754946 1 machinehealthcheck_controller.go:228] Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk has unhealthy node ip-10-0-129-181.ap-northeast-1.compute.internal with the condition Ready and the timeout 300s for 3.754936238s. Requeuing... I0807 03:41:22.807297 1 machinehealthcheck_controller.go:102] Node ip-10-0-129-181.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:22.809102 1 machinehealthcheck_controller.go:159] Initialising remediation logic for machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk I0807 03:41:22.809544 1 machinehealthcheck_controller.go:253] No remediaton action was taken. Machine jhou1-2vhkz-worker-ap-northeast-1a-4r7pk with node ip-10-0-129-181.ap-northeast-1.compute.internal is healthy ``` verified.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.2.0-0.ci-2019-08-09-034758 True False 39m Cluster version is 4.2.0-0.ci-2019-08-09-034758
1. Enable MachineHealthCheck
2. Stop the kubelet from a node
3. Create a machinehealthcheck
apiVersion: healthchecking.openshift.io/v1alpha1
kind: MachineHealthCheck
metadata:
name: example
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: zhsun3-fw4vn
machine.openshift.io/cluster-api-machine-role: worker
machine.openshift.io/cluster-api-machine-type: worker
machine.openshift.io/cluster-api-machineset: zhsun3-fw4vn-worker-us-east-2a
4. Monitor the controller log
I0809 08:09:58.109921 1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:09:58.210600 1 machinehealthcheck_controller.go:376] ConfigMap node-unhealthy-conditions not found under the namespace openshift-machine-api, fallback to default values: items:
- name: Ready
status: Unknown
timeout: 300s
- name: Ready
status: "False"
timeout: 300s
W0809 08:09:58.210764 1 machinehealthcheck_controller.go:311] Machine zhsun3-fw4vn-worker-us-east-2a-zcgns has unhealthy node ip-10-0-133-109.us-east-2.compute.internal with the condition Ready and the timeout 300s for 17.210754152s. Requeuing...
..
I0809 08:14:42.000974 1 machinehealthcheck_controller.go:301] Machine zhsun3-fw4vn-worker-us-east-2a-zcgns has been unhealthy for too long, deleting
I0809 08:14:42.013713 1 disruption.go:446] Probably machine zhsun3-fw4vn-worker-us-east-2a-zcgns was deleted, uses a dummy machine to get MDB object
E0809 08:14:42.013798 1 disruption.go:454] Unable to find MachineDisruptionBudget for machine zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:14:42.013945 1 disruption.go:446] Probably machine zhsun3-fw4vn-worker-us-east-2a-zcgns was deleted, uses a dummy machine to get MDB object
E0809 08:14:42.014022 1 disruption.go:454] Unable to find MachineDisruptionBudget for machine zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:14:42.033644 1 machinehealthcheck_controller.go:90] Reconciling MachineHealthCheck triggered by /ip-10-0-133-109.us-east-2.compute.internal
I0809 08:14:42.033703 1 machinehealthcheck_controller.go:113] Node ip-10-0-133-109.us-east-2.compute.internal is annotated with machine openshift-machine-api/zhsun3-fw4vn-worker-us-east-2a-zcgns
I0809 08:14:42.033785 1 machinehealthcheck_controller.go:242] Initialising remediation logic for machine zhsun3-fw4vn-worker-us-east-2a-zcgns
5. Check node
$ oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-136-14.us-east-2.compute.internal Ready worker 18m v1.14.0+187359a55
ip-10-0-138-209.us-east-2.compute.internal Ready master 76m v1.14.0+187359a55
ip-10-0-145-242.us-east-2.compute.internal Ready worker 71m v1.14.0+187359a55
ip-10-0-158-148.us-east-2.compute.internal Ready master 76m v1.14.0+187359a55
ip-10-0-163-203.us-east-2.compute.internal Ready worker 27m v1.14.0+187359a55
ip-10-0-164-49.us-east-2.compute.internal Ready master 76m v1.14.0+187359a55
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |
Description of problem: MachineHealthCheck cannot be associated with a target machine if it is created AFTER the node becomes unhealthy(in this case kubelet is stopped). In this situation, if the kubelet is restarted, I could see that MachineHealthCheck is immediately associated with the machine. Most of the time, this is reproducible against the machines that were created at cluster installation time. Version-Release number of selected component (if applicable): 4.1.0-0.nightly-2019-04-28-064010 How reproducible: 90% Steps to Reproduce: 1. Create a feature-gate to enable MachineHealthCheck controller apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: name: machine-api spec: featureSet: TechPreviewNoUpgrade 2. Stop the kubelet from a node 3. Create a machinehealthcheck apiVersion: healthchecking.openshift.io/v1alpha1 kind: MachineHealthCheck metadata: name: workers-checka namespace: openshift-machine-api spec: selector: matchLabels: machine.openshift.io/cluster-api-cluster: jhou1-5kkht machine.openshift.io/cluster-api-machineset: jhou1-5kkht-worker-ap-northeast-1a 4. Monitor the controller log Actual results: The controller keeps showing the following message after a very long time until the node is recovered by starting the kubelet. ``` I0430 05:36:07.920152 1 machinehealthcheck_controller.go:112] Node ip-10-0-136-110.ap-northeast-1.compute.internal is annotated with machine openshift-machine-api/jhou1-5kkht-worker-ap-northeast-1a-wxxfh I0430 05:36:07.920282 1 machinehealthcheck_controller.go:151] Machine jhou1-5kkht-worker-ap-northeast-1a-wxxfh has no MachineHealthCheck associated ``` Expected results: Should show that MachineHealthCheck is initiated for the machine. Additional info: