Description of problem: KubeletHealthState alert keeps firing status for some reason. KubeletHealthState alert fired but never resolved when the issue was gone. Version-Release number of selected component (if applicable): 4.3.12 Steps to Reproduce: 1. Firing Prometheus KubeletHealthState alert 2. Resolve the cause of KubeletHealthState 3. Actual results: Prometheus alerts keep on firing. Expected results: No Prometheus alerts are firing. Additional info:
Please attach a must gather for the cluster
Is it possible that this[1] has something to do? [1] - https://github.com/openshift/machine-config-operator/blob/release-4.3/pkg/daemon/daemon.go#L628
@Pablo @shishika Please provide a must gather for the affected clusters. If it is too large you can use gdrive and link to this BZ. Thanks.
Ryan is on leave
> Thanks, Junqi. But the alert was resolved by recreating mcd pod. Why had the alert kept firing? I'm unclear if the issue is the alert firing or the alert not going away once the issue resolved?
The issue is the alert not going away when the issue is resolved.
I was trying to verify this fix but https://bugzilla.redhat.com/show_bug.cgi?id=1871795 is blocking it. I can't log into the node to start/stop the kubelet and if I stop kubelet from an `oc debug node` context, I burn the bridge I'm standing on. 1871795 should be resolved in <24h according to bz owner.
QE note: I was able to work around bz1871795 with the following procedure: choose a worker for testing the fix oc debug node/<worker> chroot /host cd /var/home/core/.ssh cp authorized_keys.d/ignition authorized_keys chown core:core authorized_keys Now you should be able to ssh into the node and stop the kubelet, wait for alert to trigger, start kubelet, wait for the alert to clear.
Created attachment 1712575 [details] fix-confirmation.png
I was just checking it. Verified on 4.6.0-0.nightly-2020-09-09-003430. Stopped kubelet service on one node and see alerts firing and they were cleared once kubelet was started again. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-09-003430 True False 10h Cluster version is 4.6.0-0.nightly-2020-09-09-003430 $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_kubelet_state>2' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 334 100 334 0 0 6359 0 --:--:-- --:--:-- --:--:-- 6423 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "mcd_kubelet_state", "endpoint": "metrics", "instance": "10.0.174.178:9001", "job": "machine-config-daemon", "namespace": "openshift-machine-config-operator", "pod": "machine-config-daemon-kxqvp", "service": "machine-config-daemon" }, "value": [ 1599668105.29, "7" ] } ] } } $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_kubelet_state>2' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 63 100 63 0 0 1803 0 --:--:-- --:--:-- --:--:-- 1852 { "status": "success", "data": { "resultType": "vector", "result": [] } }
Created attachment 1714310 [details] metrics-fix-confirmation
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
Can this be backported to 4.5 as this is confusing customers a lot?
This has been backported to 4.5.z in 4.5.11 https://bugzilla.redhat.com/show_bug.cgi?id=1872337