Bug 1854009

Summary: KubeletHealthState alert keeps firing
Product: OpenShift Container Platform Reporter: shishika
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED ERRATA QA Contact: Sunil Choudhary <schoudha>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3.zCC: alegrand, anpicker, aos-bugs, ehashman, erooth, jokerman, kakkoyun, kgarriso, lcosic, mjudeiki, mloibl, palonsor, pkrupa, rrackow, schoudha, surbania, wking
Target Milestone: ---Keywords: ServiceDeliveryImpact
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:12:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1872337    
Attachments:
Description Flags
fix-confirmation.png
none
metrics-fix-confirmation none

Description shishika 2020-07-06 06:08:37 UTC
Description of problem:
KubeletHealthState alert keeps firing status for some reason. KubeletHealthState alert fired but never resolved when the issue was gone.

Version-Release number of selected component (if applicable):
4.3.12

Steps to Reproduce:
1. Firing Prometheus KubeletHealthState alert
2. Resolve the cause of KubeletHealthState
3.

Actual results:
Prometheus alerts keep on firing.

Expected results:
No Prometheus alerts are firing.

Additional info:

Comment 6 Kirsten Garrison 2020-07-09 16:33:02 UTC
Please attach a must gather for the cluster

Comment 10 Pablo Alonso Rodriguez 2020-07-14 07:32:28 UTC
Is it possible that this[1] has something to do?

[1] - https://github.com/openshift/machine-config-operator/blob/release-4.3/pkg/daemon/daemon.go#L628

Comment 11 Kirsten Garrison 2020-07-15 17:05:07 UTC
@Pablo @shishika Please provide a must gather for the affected clusters. If it is too large you can use gdrive and link to this BZ. Thanks.

Comment 20 Seth Jennings 2020-08-10 16:06:30 UTC
Ryan is on leave

Comment 21 Seth Jennings 2020-08-17 17:11:40 UTC
> Thanks, Junqi. But the alert was resolved by recreating mcd pod. Why had the alert kept firing?

I'm unclear if the issue is the alert firing or the alert not going away once the issue resolved?

Comment 22 shishika 2020-08-20 01:09:37 UTC
The issue is the alert not going away when the issue is resolved.

Comment 26 Seth Jennings 2020-08-25 16:38:12 UTC
I was trying to verify this fix but https://bugzilla.redhat.com/show_bug.cgi?id=1871795 is blocking it.  I can't log into the node to start/stop the kubelet and if I stop kubelet from an `oc debug node` context, I burn the bridge I'm standing on.  1871795 should be resolved in <24h according to bz owner.

Comment 27 Seth Jennings 2020-08-25 17:03:04 UTC
QE note:

I was able to work around bz1871795 with the following procedure:

choose a worker for testing the fix

oc debug node/<worker>
chroot /host
cd /var/home/core/.ssh
cp authorized_keys.d/ignition authorized_keys
chown core:core authorized_keys

Now you should be able to ssh into the node and stop the kubelet, wait for alert to trigger, start kubelet, wait for the alert to clear.

Comment 28 Seth Jennings 2020-08-25 17:03:37 UTC
Created attachment 1712575 [details]
fix-confirmation.png

Comment 31 Sunil Choudhary 2020-09-09 16:30:28 UTC
I was just checking it. Verified on 4.6.0-0.nightly-2020-09-09-003430. Stopped kubelet service on one node and see alerts firing and they were cleared once kubelet was started again.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-09-003430   True        False         10h     Cluster version is 4.6.0-0.nightly-2020-09-09-003430

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_kubelet_state>2' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   334  100   334    0     0   6359      0 --:--:-- --:--:-- --:--:--  6423
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "mcd_kubelet_state",
          "endpoint": "metrics",
          "instance": "10.0.174.178:9001",
          "job": "machine-config-daemon",
          "namespace": "openshift-machine-config-operator",
          "pod": "machine-config-daemon-kxqvp",
          "service": "machine-config-daemon"
        },
        "value": [
          1599668105.29,
          "7"
        ]
      }
    ]
  }
}

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_kubelet_state>2' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    63  100    63    0     0   1803      0 --:--:-- --:--:-- --:--:--  1852
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": []
  }
}

Comment 32 Sunil Choudhary 2020-09-09 16:32:54 UTC
Created attachment 1714310 [details]
metrics-fix-confirmation

Comment 34 errata-xmlrpc 2020-10-27 16:12:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 35 Mangirdas Judeikis 2020-11-03 08:47:29 UTC
Can this be backported to 4.5 as this is confusing customers a lot?

Comment 36 Seth Jennings 2020-11-03 15:21:31 UTC
This has been backported to 4.5.z in 4.5.11
https://bugzilla.redhat.com/show_bug.cgi?id=1872337