Description of problem: node "ip-10-0-152-16.us-east-2.compute.internal", there are 19 running pods and 4 Completed pods, but search via kubelet_running_pod_count{node="ip-10-0-152-16.us-east-2.compute.internal"},there are 21 pods, which is wrong, should be 19 # oc get pod --all-namespaces -o wide | grep "ip-10-0-152-16.us-east-2.compute.internal" openshift-apiserver apiserver-846f6ddf85-jsgw8 2/2 Running 0 3h46m 10.129.0.7 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-cluster-node-tuning-operator tuned-69kbs 1/1 Running 0 4h45m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-controller-manager controller-manager-wmdvv 1/1 Running 0 4h37m 10.129.0.3 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-dns dns-default-xvwbt 3/3 Running 0 4h44m 10.129.0.2 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-etcd etcd-ip-10-0-152-16.us-east-2.compute.internal 3/3 Running 0 4h36m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-etcd etcd-quorum-guard-845f945cb8-t92f5 1/1 Running 0 3h46m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-etcd revision-pruner-4-ip-10-0-152-16.us-east-2.compute.internal 0/1 Completed 0 3h46m 10.129.0.31 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-image-registry node-ca-qmh9n 1/1 Running 0 4h38m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-kube-apiserver kube-apiserver-ip-10-0-152-16.us-east-2.compute.internal 5/5 Running 0 4h26m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-kube-apiserver revision-pruner-7-ip-10-0-152-16.us-east-2.compute.internal 0/1 Completed 0 3h45m 10.129.0.4 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-kube-controller-manager kube-controller-manager-ip-10-0-152-16.us-east-2.compute.internal 4/4 Running 0 4h32m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-kube-controller-manager revision-pruner-7-ip-10-0-152-16.us-east-2.compute.internal 0/1 Completed 0 3h46m 10.129.0.32 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-kube-scheduler openshift-kube-scheduler-ip-10-0-152-16.us-east-2.compute.internal 2/2 Running 0 4h34m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-kube-scheduler revision-pruner-8-ip-10-0-152-16.us-east-2.compute.internal 0/1 Completed 0 3h45m 10.129.0.6 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-machine-config-operator etcd-quorum-guard-85cd58c9cb-cnt92 1/1 Running 0 3h46m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-machine-config-operator machine-config-daemon-lh5gx 2/2 Running 0 4h45m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-machine-config-operator machine-config-server-vpxng 1/1 Running 0 4h44m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-monitoring node-exporter-9pl66 2/2 Running 0 4h38m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-multus multus-admission-controller-l9j2s 2/2 Running 0 4h45m 10.129.0.5 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-multus multus-rs2fp 1/1 Running 0 4h45m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-sdn ovs-gfdk2 1/1 Running 0 4h45m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-sdn sdn-controller-2gg6g 1/1 Running 0 4h45m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> openshift-sdn sdn-zmxbb 1/1 Running 0 4h45m 10.0.152.16 ip-10-0-152-16.us-east-2.compute.internal <none> <none> # oc get pod --all-namespaces -o wide | grep "ip-10-0-152-16.us-east-2.compute.internal" | grep Running | wc -l 19 # oc get pod --all-namespaces -o wide | grep "ip-10-0-152-16.us-east-2.compute.internal" | grep Completed | wc -l 4 # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=kubelet_running_pod_count{node="ip-10-0-152-16.us-east-2.compute.internal"}' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 341 100 341 0 0 6493 0 --:--:-- --:--:-- --:--:-- 6557 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "kubelet_running_pod_count", "endpoint": "https-metrics", "instance": "10.0.152.16:10250", "job": "kubelet", "metrics_path": "/metrics", "namespace": "kube-system", "node": "ip-10-0-152-16.us-east-2.compute.internal", "service": "kubelet" }, "value": [ 1596001589.458, "21" ] } ] } } # oc describe node ip-10-0-152-16.us-east-2.compute.internal Non-terminated Pods: (19 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- openshift-apiserver apiserver-846f6ddf85-jsgw8 110m (3%) 0 (0%) 250Mi (1%) 0 (0%) 3h46m openshift-cluster-node-tuning-operator tuned-69kbs 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 4h44m openshift-controller-manager controller-manager-wmdvv 100m (2%) 0 (0%) 100Mi (0%) 0 (0%) 4h37m openshift-dns dns-default-xvwbt 65m (1%) 0 (0%) 110Mi (0%) 512Mi (3%) 4h44m openshift-etcd etcd-ip-10-0-152-16.us-east-2.compute.internal 430m (12%) 0 (0%) 860Mi (5%) 0 (0%) 4h36m openshift-etcd etcd-quorum-guard-845f945cb8-t92f5 10m (0%) 0 (0%) 5Mi (0%) 0 (0%) 3h46m openshift-image-registry node-ca-qmh9n 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 4h38m openshift-kube-apiserver kube-apiserver-ip-10-0-152-16.us-east-2.compute.internal 340m (9%) 0 (0%) 1224Mi (8%) 0 (0%) 4h26m openshift-kube-controller-manager kube-controller-manager-ip-10-0-152-16.us-east-2.compute.internal 100m (2%) 0 (0%) 500Mi (3%) 0 (0%) 4h32m openshift-kube-scheduler openshift-kube-scheduler-ip-10-0-152-16.us-east-2.compute.internal 20m (0%) 0 (0%) 100Mi (0%) 0 (0%) 4h34m openshift-machine-config-operator etcd-quorum-guard-85cd58c9cb-cnt92 10m (0%) 0 (0%) 5Mi (0%) 0 (0%) 3h46m openshift-machine-config-operator machine-config-daemon-lh5gx 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 4h45m openshift-machine-config-operator machine-config-server-vpxng 20m (0%) 0 (0%) 50Mi (0%) 0 (0%) 4h44m openshift-monitoring node-exporter-9pl66 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 4h38m openshift-multus multus-admission-controller-l9j2s 20m (0%) 0 (0%) 20Mi (0%) 0 (0%) 4h45m openshift-multus multus-rs2fp 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 4h45m openshift-sdn ovs-gfdk2 100m (2%) 0 (0%) 400Mi (2%) 0 (0%) 4h45m openshift-sdn sdn-controller-2gg6g 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 4h45m openshift-sdn sdn-zmxbb 100m (2%) 0 (0%) 200Mi (1%) 0 (0%) 4h45m Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-07-28-195907 How reproducible: always Steps to Reproduce: 1. See the descripiton 2. 3. Actual results: kubelet_running_pod_count result is wrong Expected results: should be right Additional info:
Interesting! Seems like kube_pod_info has the correct information, but not kubelet_running_pod_count. There is an upstream issue for this already https://github.com/kubernetes/kubernetes/issues/81412 and https://github.com/kubernetes/kubernetes/pull/92187 that Pawel did so reasoning to him.
Alerting based on this metric is fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1846805. I'll leave it open as upstream fix is not yet finished in https://github.com/kubernetes/kubernetes/pull/92187 Reducing severity as alerts have a workaround.
Upstream fix in https://github.com/kubernetes/kubernetes/pull/85983 seems to be progressing and already lgtm'd. Reassigning to node team to shepherd porting fix into OpenShift.
regressed upstream in 1.16 https://github.com/kubernetes/kubernetes/pull/85983/files#r424681852 https://github.com/kubernetes/kubernetes/commit/c02d49d775b4dc960f52af1f5295642c07947ca7
Verified on 4.6.0-0.nightly-2020-08-18-165040, $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-18-165040 True False 3h26m Cluster version is 4.6.0-0.nightly-2020-08-18-165040 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-146-213.us-east-2.compute.internal Ready worker 3h40m v1.19.0-rc.2+99cb93a-dirty ip-10-0-149-17.us-east-2.compute.internal Ready master 3h50m v1.19.0-rc.2+99cb93a-dirty ip-10-0-183-120.us-east-2.compute.internal Ready worker 3h40m v1.19.0-rc.2+99cb93a-dirty ip-10-0-185-235.us-east-2.compute.internal Ready master 3h50m v1.19.0-rc.2+99cb93a-dirty ip-10-0-212-255.us-east-2.compute.internal Ready worker 3h40m v1.19.0-rc.2+99cb93a-dirty ip-10-0-218-126.us-east-2.compute.internal Ready master 3h50m v1.19.0-rc.2+99cb93a-dirty $ oc get pod --all-namespaces -o wide | grep "ip-10-0-149-17.us-east-2.compute.internal" | grep Running | wc -l 21 $ oc get pod --all-namespaces -o wide | grep "ip-10-0-149-17.us-east-2.compute.internal" | grep Completed | wc -l 4 $ token=`oc sa get-token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=kubelet_running_pods{node="ip-10-0-149-17.us-east-2.compute.internal"}' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 336 100 336 0 0 7205 0 --:--:-- --:--:-- --:--:-- 7304 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "kubelet_running_pods", "endpoint": "https-metrics", "instance": "10.0.149.17:10250", "job": "kubelet", "metrics_path": "/metrics", "namespace": "kube-system", "node": "ip-10-0-149-17.us-east-2.compute.internal", "service": "kubelet" }, "value": [ 1597827031.686, "21" ] } ] } }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196