Bug 1854009 - KubeletHealthState alert keeps firing
Summary: KubeletHealthState alert keeps firing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Seth Jennings
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 1872337
TreeView+ depends on / blocked
 
Reported: 2020-07-06 06:08 UTC by shishika
Modified: 2024-03-25 16:08 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:12:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fix-confirmation.png (100.31 KB, image/png)
2020-08-25 17:03 UTC, Seth Jennings
no flags Details
metrics-fix-confirmation (146.97 KB, image/png)
2020-09-09 16:32 UTC, Sunil Choudhary
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2021 0 None closed Bug 1854009: remove err cardinality from mcd_kubelet_state metrics 2021-02-05 06:45:42 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:12:53 UTC

Description shishika 2020-07-06 06:08:37 UTC
Description of problem:
KubeletHealthState alert keeps firing status for some reason. KubeletHealthState alert fired but never resolved when the issue was gone.

Version-Release number of selected component (if applicable):
4.3.12

Steps to Reproduce:
1. Firing Prometheus KubeletHealthState alert
2. Resolve the cause of KubeletHealthState
3.

Actual results:
Prometheus alerts keep on firing.

Expected results:
No Prometheus alerts are firing.

Additional info:

Comment 6 Kirsten Garrison 2020-07-09 16:33:02 UTC
Please attach a must gather for the cluster

Comment 10 Pablo Alonso Rodriguez 2020-07-14 07:32:28 UTC
Is it possible that this[1] has something to do?

[1] - https://github.com/openshift/machine-config-operator/blob/release-4.3/pkg/daemon/daemon.go#L628

Comment 11 Kirsten Garrison 2020-07-15 17:05:07 UTC
@Pablo @shishika Please provide a must gather for the affected clusters. If it is too large you can use gdrive and link to this BZ. Thanks.

Comment 20 Seth Jennings 2020-08-10 16:06:30 UTC
Ryan is on leave

Comment 21 Seth Jennings 2020-08-17 17:11:40 UTC
> Thanks, Junqi. But the alert was resolved by recreating mcd pod. Why had the alert kept firing?

I'm unclear if the issue is the alert firing or the alert not going away once the issue resolved?

Comment 22 shishika 2020-08-20 01:09:37 UTC
The issue is the alert not going away when the issue is resolved.

Comment 26 Seth Jennings 2020-08-25 16:38:12 UTC
I was trying to verify this fix but https://bugzilla.redhat.com/show_bug.cgi?id=1871795 is blocking it.  I can't log into the node to start/stop the kubelet and if I stop kubelet from an `oc debug node` context, I burn the bridge I'm standing on.  1871795 should be resolved in <24h according to bz owner.

Comment 27 Seth Jennings 2020-08-25 17:03:04 UTC
QE note:

I was able to work around bz1871795 with the following procedure:

choose a worker for testing the fix

oc debug node/<worker>
chroot /host
cd /var/home/core/.ssh
cp authorized_keys.d/ignition authorized_keys
chown core:core authorized_keys

Now you should be able to ssh into the node and stop the kubelet, wait for alert to trigger, start kubelet, wait for the alert to clear.

Comment 28 Seth Jennings 2020-08-25 17:03:37 UTC
Created attachment 1712575 [details]
fix-confirmation.png

Comment 31 Sunil Choudhary 2020-09-09 16:30:28 UTC
I was just checking it. Verified on 4.6.0-0.nightly-2020-09-09-003430. Stopped kubelet service on one node and see alerts firing and they were cleared once kubelet was started again.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-09-003430   True        False         10h     Cluster version is 4.6.0-0.nightly-2020-09-09-003430

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_kubelet_state>2' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   334  100   334    0     0   6359      0 --:--:-- --:--:-- --:--:--  6423
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "mcd_kubelet_state",
          "endpoint": "metrics",
          "instance": "10.0.174.178:9001",
          "job": "machine-config-daemon",
          "namespace": "openshift-machine-config-operator",
          "pod": "machine-config-daemon-kxqvp",
          "service": "machine-config-daemon"
        },
        "value": [
          1599668105.29,
          "7"
        ]
      }
    ]
  }
}

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_kubelet_state>2' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    63  100    63    0     0   1803      0 --:--:-- --:--:-- --:--:--  1852
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": []
  }
}

Comment 32 Sunil Choudhary 2020-09-09 16:32:54 UTC
Created attachment 1714310 [details]
metrics-fix-confirmation

Comment 34 errata-xmlrpc 2020-10-27 16:12:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 35 Mangirdas Judeikis 2020-11-03 08:47:29 UTC
Can this be backported to 4.5 as this is confusing customers a lot?

Comment 36 Seth Jennings 2020-11-03 15:21:31 UTC
This has been backported to 4.5.z in 4.5.11
https://bugzilla.redhat.com/show_bug.cgi?id=1872337


Note You need to log in before you can comment on or make changes to this bug.