1854009 – KubeletHealthState alert keeps firing

Bug 1854009 - KubeletHealthState alert keeps firing

Summary: KubeletHealthState alert keeps firing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Seth Jennings
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1872337
TreeView+	depends on / blocked

Reported:	2020-07-06 06:08 UTC by shishika
Modified:	2024-03-25 16:08 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:12:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
fix-confirmation.png (100.31 KB, image/png) 2020-08-25 17:03 UTC, Seth Jennings	no flags	Details
metrics-fix-confirmation (146.97 KB, image/png) 2020-09-09 16:32 UTC, Sunil Choudhary	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2021	0	None	closed	Bug 1854009: remove err cardinality from mcd_kubelet_state metrics	2021-02-05 06:45:42 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:12:53 UTC

Description shishika 2020-07-06 06:08:37 UTC

Description of problem:
KubeletHealthState alert keeps firing status for some reason. KubeletHealthState alert fired but never resolved when the issue was gone.

Version-Release number of selected component (if applicable):
4.3.12

Steps to Reproduce:
1. Firing Prometheus KubeletHealthState alert
2. Resolve the cause of KubeletHealthState
3.

Actual results:
Prometheus alerts keep on firing.

Expected results:
No Prometheus alerts are firing.

Additional info:

Comment 6 Kirsten Garrison 2020-07-09 16:33:02 UTC

Please attach a must gather for the cluster

Comment 10 Pablo Alonso Rodriguez 2020-07-14 07:32:28 UTC

Is it possible that this[1] has something to do?

[1] - https://github.com/openshift/machine-config-operator/blob/release-4.3/pkg/daemon/daemon.go#L628

Comment 11 Kirsten Garrison 2020-07-15 17:05:07 UTC

@Pablo @shishika Please provide a must gather for the affected clusters. If it is too large you can use gdrive and link to this BZ. Thanks.

Comment 20 Seth Jennings 2020-08-10 16:06:30 UTC

Ryan is on leave

Comment 21 Seth Jennings 2020-08-17 17:11:40 UTC

> Thanks, Junqi. But the alert was resolved by recreating mcd pod. Why had the alert kept firing?

I'm unclear if the issue is the alert firing or the alert not going away once the issue resolved?

Comment 22 shishika 2020-08-20 01:09:37 UTC

The issue is the alert not going away when the issue is resolved.

Comment 26 Seth Jennings 2020-08-25 16:38:12 UTC

I was trying to verify this fix but https://bugzilla.redhat.com/show_bug.cgi?id=1871795 is blocking it.  I can't log into the node to start/stop the kubelet and if I stop kubelet from an `oc debug node` context, I burn the bridge I'm standing on.  1871795 should be resolved in <24h according to bz owner.

Comment 27 Seth Jennings 2020-08-25 17:03:04 UTC

QE note:

I was able to work around bz1871795 with the following procedure:

choose a worker for testing the fix

oc debug node/<worker>
chroot /host
cd /var/home/core/.ssh
cp authorized_keys.d/ignition authorized_keys
chown core:core authorized_keys

Now you should be able to ssh into the node and stop the kubelet, wait for alert to trigger, start kubelet, wait for the alert to clear.

Comment 28 Seth Jennings 2020-08-25 17:03:37 UTC

Created attachment 1712575 [details]
fix-confirmation.png

Comment 31 Sunil Choudhary 2020-09-09 16:30:28 UTC

I was just checking it. Verified on 4.6.0-0.nightly-2020-09-09-003430. Stopped kubelet service on one node and see alerts firing and they were cleared once kubelet was started again.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-09-003430   True        False         10h     Cluster version is 4.6.0-0.nightly-2020-09-09-003430

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_kubelet_state>2' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   334  100   334    0     0   6359      0 --:--:-- --:--:-- --:--:--  6423
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "mcd_kubelet_state",
          "endpoint": "metrics",
          "instance": "10.0.174.178:9001",
          "job": "machine-config-daemon",
          "namespace": "openshift-machine-config-operator",
          "pod": "machine-config-daemon-kxqvp",
          "service": "machine-config-daemon"
        },
        "value": [
          1599668105.29,
          "7"
        ]
      }
    ]
  }
}

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_kubelet_state>2' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    63  100    63    0     0   1803      0 --:--:-- --:--:-- --:--:--  1852
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": []
  }
}

Comment 32 Sunil Choudhary 2020-09-09 16:32:54 UTC

Created attachment 1714310 [details]
metrics-fix-confirmation

Comment 34 errata-xmlrpc 2020-10-27 16:12:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 35 Mangirdas Judeikis 2020-11-03 08:47:29 UTC

Can this be backported to 4.5 as this is confusing customers a lot?

Comment 36 Seth Jennings 2020-11-03 15:21:31 UTC

This has been backported to 4.5.z in 4.5.11
https://bugzilla.redhat.com/show_bug.cgi?id=1872337

Note You need to log in before you can comment on or make changes to this bug.