1861631 – Result for kubelet_running_pod_count is wrong

Bug 1861631 - Result for kubelet_running_pod_count is wrong

Summary: Result for kubelet_running_pod_count is wrong

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Seth Jennings
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1862194
TreeView+	depends on / blocked

Reported:	2020-07-29 06:18 UTC by Junqi Zhao
Modified:	2020-10-27 16:21 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1862194 1862195 1862197 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:21:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes pull 85983	None	closed	fix metrics kubelet_running_pod_count	2021-01-20 04:11:40 UTC
Github	openshift kubernetes pull 303	None	closed	Bug 1861631: UPSTREAM: 85983: fix metrics kubelet_running_pod_count	2021-01-20 04:10:59 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:21:40 UTC

Description Junqi Zhao 2020-07-29 06:18:29 UTC

Description of problem:
node "ip-10-0-152-16.us-east-2.compute.internal", there are 19 running pods and 4 Completed pods, but search via kubelet_running_pod_count{node="ip-10-0-152-16.us-east-2.compute.internal"},there are 21 pods, which is wrong, should be 19
#  oc get pod --all-namespaces -o wide | grep "ip-10-0-152-16.us-east-2.compute.internal"
openshift-apiserver                                apiserver-846f6ddf85-jsgw8                                            2/2     Running     0          3h46m   10.129.0.7     ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-cluster-node-tuning-operator             tuned-69kbs                                                           1/1     Running     0          4h45m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-controller-manager                       controller-manager-wmdvv                                              1/1     Running     0          4h37m   10.129.0.3     ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-dns                                      dns-default-xvwbt                                                     3/3     Running     0          4h44m   10.129.0.2     ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-etcd                                     etcd-ip-10-0-152-16.us-east-2.compute.internal                        3/3     Running     0          4h36m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-etcd                                     etcd-quorum-guard-845f945cb8-t92f5                                    1/1     Running     0          3h46m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-etcd                                     revision-pruner-4-ip-10-0-152-16.us-east-2.compute.internal           0/1     Completed   0          3h46m   10.129.0.31    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-image-registry                           node-ca-qmh9n                                                         1/1     Running     0          4h38m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-kube-apiserver                           kube-apiserver-ip-10-0-152-16.us-east-2.compute.internal              5/5     Running     0          4h26m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-kube-apiserver                           revision-pruner-7-ip-10-0-152-16.us-east-2.compute.internal           0/1     Completed   0          3h45m   10.129.0.4     ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-kube-controller-manager                  kube-controller-manager-ip-10-0-152-16.us-east-2.compute.internal     4/4     Running     0          4h32m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-kube-controller-manager                  revision-pruner-7-ip-10-0-152-16.us-east-2.compute.internal           0/1     Completed   0          3h46m   10.129.0.32    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-kube-scheduler                           openshift-kube-scheduler-ip-10-0-152-16.us-east-2.compute.internal    2/2     Running     0          4h34m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-kube-scheduler                           revision-pruner-8-ip-10-0-152-16.us-east-2.compute.internal           0/1     Completed   0          3h45m   10.129.0.6     ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-machine-config-operator                  etcd-quorum-guard-85cd58c9cb-cnt92                                    1/1     Running     0          3h46m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-machine-config-operator                  machine-config-daemon-lh5gx                                           2/2     Running     0          4h45m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-machine-config-operator                  machine-config-server-vpxng                                           1/1     Running     0          4h44m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               node-exporter-9pl66                                                   2/2     Running     0          4h38m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-multus                                   multus-admission-controller-l9j2s                                     2/2     Running     0          4h45m   10.129.0.5     ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-multus                                   multus-rs2fp                                                          1/1     Running     0          4h45m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-sdn                                      ovs-gfdk2                                                             1/1     Running     0          4h45m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-sdn                                      sdn-controller-2gg6g                                                  1/1     Running     0          4h45m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>
openshift-sdn                                      sdn-zmxbb                                                             1/1     Running     0          4h45m   10.0.152.16    ip-10-0-152-16.us-east-2.compute.internal    <none>           <none>

# oc get pod --all-namespaces -o wide | grep "ip-10-0-152-16.us-east-2.compute.internal" | grep Running |  wc -l
19

# oc get pod --all-namespaces -o wide | grep "ip-10-0-152-16.us-east-2.compute.internal" | grep Completed |  wc -l
4

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=kubelet_running_pod_count{node="ip-10-0-152-16.us-east-2.compute.internal"}' | jq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   341  100   341    0     0   6493      0 --:--:-- --:--:-- --:--:--  6557
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "kubelet_running_pod_count",
          "endpoint": "https-metrics",
          "instance": "10.0.152.16:10250",
          "job": "kubelet",
          "metrics_path": "/metrics",
          "namespace": "kube-system",
          "node": "ip-10-0-152-16.us-east-2.compute.internal",
          "service": "kubelet"
        },
        "value": [
          1596001589.458,
          "21"
        ]
      }
    ]
  }
}

# oc describe node ip-10-0-152-16.us-east-2.compute.internal
Non-terminated Pods:                      (19 in total)
  Namespace                               Name                                                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                                                  ------------  ----------  ---------------  -------------  ---
  openshift-apiserver                     apiserver-846f6ddf85-jsgw8                                            110m (3%)     0 (0%)      250Mi (1%)       0 (0%)         3h46m
  openshift-cluster-node-tuning-operator  tuned-69kbs                                                           10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         4h44m
  openshift-controller-manager            controller-manager-wmdvv                                              100m (2%)     0 (0%)      100Mi (0%)       0 (0%)         4h37m
  openshift-dns                           dns-default-xvwbt                                                     65m (1%)      0 (0%)      110Mi (0%)       512Mi (3%)     4h44m
  openshift-etcd                          etcd-ip-10-0-152-16.us-east-2.compute.internal                        430m (12%)    0 (0%)      860Mi (5%)       0 (0%)         4h36m
  openshift-etcd                          etcd-quorum-guard-845f945cb8-t92f5                                    10m (0%)      0 (0%)      5Mi (0%)         0 (0%)         3h46m
  openshift-image-registry                node-ca-qmh9n                                                         10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         4h38m
  openshift-kube-apiserver                kube-apiserver-ip-10-0-152-16.us-east-2.compute.internal              340m (9%)     0 (0%)      1224Mi (8%)      0 (0%)         4h26m
  openshift-kube-controller-manager       kube-controller-manager-ip-10-0-152-16.us-east-2.compute.internal     100m (2%)     0 (0%)      500Mi (3%)       0 (0%)         4h32m
  openshift-kube-scheduler                openshift-kube-scheduler-ip-10-0-152-16.us-east-2.compute.internal    20m (0%)      0 (0%)      100Mi (0%)       0 (0%)         4h34m
  openshift-machine-config-operator       etcd-quorum-guard-85cd58c9cb-cnt92                                    10m (0%)      0 (0%)      5Mi (0%)         0 (0%)         3h46m
  openshift-machine-config-operator       machine-config-daemon-lh5gx                                           40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         4h45m
  openshift-machine-config-operator       machine-config-server-vpxng                                           20m (0%)      0 (0%)      50Mi (0%)        0 (0%)         4h44m
  openshift-monitoring                    node-exporter-9pl66                                                   9m (0%)       0 (0%)      210Mi (1%)       0 (0%)         4h38m
  openshift-multus                        multus-admission-controller-l9j2s                                     20m (0%)      0 (0%)      20Mi (0%)        0 (0%)         4h45m
  openshift-multus                        multus-rs2fp                                                          10m (0%)      0 (0%)      150Mi (1%)       0 (0%)         4h45m
  openshift-sdn                           ovs-gfdk2                                                             100m (2%)     0 (0%)      400Mi (2%)       0 (0%)         4h45m
  openshift-sdn                           sdn-controller-2gg6g                                                  10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         4h45m
  openshift-sdn                           sdn-zmxbb                                                             100m (2%)     0 (0%)      200Mi (1%)       0 (0%)         4h45m


Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-07-28-195907

How reproducible:
always

Steps to Reproduce:
1. See the descripiton
2.
3.

Actual results:
kubelet_running_pod_count result is wrong

Expected results:
should be right

Additional info:

Comment 1 Lili Cosic 2020-07-29 09:37:51 UTC

Interesting! Seems like kube_pod_info has the correct information, but not kubelet_running_pod_count.

There is an upstream issue for this already https://github.com/kubernetes/kubernetes/issues/81412 and https://github.com/kubernetes/kubernetes/pull/92187 that Pawel did so reasoning to him.

Comment 2 Pawel Krupa 2020-07-29 09:43:45 UTC

Alerting based on this metric is fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1846805. 

I'll leave it open as upstream fix is not yet finished in https://github.com/kubernetes/kubernetes/pull/92187

Reducing severity as alerts have a workaround.

Comment 3 Pawel Krupa 2020-07-29 12:35:19 UTC

Upstream fix in https://github.com/kubernetes/kubernetes/pull/85983 seems to be progressing and already lgtm'd. Reassigning to node team to shepherd porting fix into OpenShift.

Comment 5 Seth Jennings 2020-07-30 17:03:48 UTC

regressed upstream in 1.16
https://github.com/kubernetes/kubernetes/pull/85983/files#r424681852
https://github.com/kubernetes/kubernetes/commit/c02d49d775b4dc960f52af1f5295642c07947ca7

Comment 9 Sunil Choudhary 2020-08-19 08:54:59 UTC

Verified on 4.6.0-0.nightly-2020-08-18-165040,

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-18-165040   True        False         3h26m   Cluster version is 4.6.0-0.nightly-2020-08-18-165040

$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-146-213.us-east-2.compute.internal   Ready    worker   3h40m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-149-17.us-east-2.compute.internal    Ready    master   3h50m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-183-120.us-east-2.compute.internal   Ready    worker   3h40m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-185-235.us-east-2.compute.internal   Ready    master   3h50m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-212-255.us-east-2.compute.internal   Ready    worker   3h40m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-218-126.us-east-2.compute.internal   Ready    master   3h50m   v1.19.0-rc.2+99cb93a-dirty


$ oc get pod --all-namespaces -o wide | grep "ip-10-0-149-17.us-east-2.compute.internal" | grep Running |  wc -l
21

$ oc get pod --all-namespaces -o wide | grep "ip-10-0-149-17.us-east-2.compute.internal" | grep Completed |  wc -l
4

$ token=`oc sa get-token prometheus-k8s -n openshift-monitoring`

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=kubelet_running_pods{node="ip-10-0-149-17.us-east-2.compute.internal"}' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   336  100   336    0     0   7205      0 --:--:-- --:--:-- --:--:--  7304
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "kubelet_running_pods",
          "endpoint": "https-metrics",
          "instance": "10.0.149.17:10250",
          "job": "kubelet",
          "metrics_path": "/metrics",
          "namespace": "kube-system",
          "node": "ip-10-0-149-17.us-east-2.compute.internal",
          "service": "kubelet"
        },
        "value": [
          1597827031.686,
          "21"
        ]
      }
    ]
  }
}

Comment 11 errata-xmlrpc 2020-10-27 16:21:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.