Created attachment 1710759 [details] Filesystem and Pod count are No datapoints found for node Description of problem: cluster admin, login console, goto "Compute -> Nodes" and select one node to check the node overview. see from picture, Filesystem and Pod count are No datapoints found for node checked from API Filesystem: instance:node_filesystem_usage:sum{instance='qe-anusaxen10-nhx4f-master-0'} Pod count: kubelet_running_pod_count{instance=~'139.178.76.49:.*'} these metrics are not in prometheus now, we should use other metrics # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep "kubelet_running_pod_count" no result # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep "instance:node_filesystem_usage:sum" no result Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-08-06-131904 How reproducible: always Steps to Reproduce: 1. see the description 2. 3. Actual results: Expected results: Additional info:
https://github.com/kubernetes/sig-release/blob/858fc2d68c731352df9ab94b5160436deccf5eab/releases/release-1.19/release-notes-draft.md Kubelet: following metrics have been renamed: kubelet_running_container_count --> kubelet_running_containers kubelet_running_pod_count --> kubelet_running_pods (#92407, @RainbowMango) [SIG API Machinery, Cluster Lifecycle, Instrumentation and Node]
*** Bug 1865741 has been marked as a duplicate of this bug. ***
Ranaming kubelet_running_pod_count --> kubelet_running_pods is fixing the issue with the pod counts. Other issue is the FileSystem Usage since the `instance:node_filesystem_usage:sum` was removed completely from the Prometheus Operator in https://github.com/prometheus-operator/kube-prometheus/pull/617
I have fixes ready for the pods issues, but I'm waiting to hear back from the monitoring team on what query we should be using for the filesystem issue.
*** Bug 1873044 has been marked as a duplicate of this bug. ***
The latest accepted 4.6 nightly 4.6.0-0.nightly-2020-08-27-005538 doesn't include the fix PR # export PAYLOAD=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-27-005538 [root@preserved-qe-ui-rhel-1 console]# oc adm release info $PAYLOAD --pullspecs | grep console console quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3a3ccd44b0545258785c52d90d43a2bebc80365f52a7f2eb28601a2957020310 console-operator quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:741428633e21bddb5f6b86970a6d62f9ed2b762bafd074f2bb7a09aa3aaf5d0a [root@preserved-qe-ui-rhel-1 console]# export CONSOLE_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3a3ccd44b0545258785c52d90d43a2bebc80365f52a7f2eb28601a2957020310 [root@preserved-qe-ui-rhel-1 console]# oc image info $CONSOLE_IMAGE | grep commit io.openshift.build.commit.id=9cc959b2ec10e83dd0f9c6dc0e70c0ae4fb03daf io.openshift.build.commit.url=https://github.com/openshift/console/commit/9cc959b2ec10e83dd0f9c6dc0e70c0ae4fb03daf [root@preserved-qe-ui-rhel-1 console]# git log 9cc959b2ec10e83dd0f9c6dc0e70c0ae4fb03daf | grep '#6419' //nothing returns
nightly build 4.6.0-0.nightly-2020-08-31-012413 contains the fix In Node Utilization charts, Filesystem and Pods are using the queries in PR, but in Cluster Utilization charts, there are still some issues: 1) Filesystem -> By Node shows no data 2) Pods -> By Node is still using 'topk(25, sort_desc(sum(avg_over_time(kube_pod_info[5m])) BY (node)))' which is different from the query in Node Utilization charts 'kubelet_running_pods{instance=~'10.0.150.132:.*'}' Assigning back for another fix
Thanks for flagging this; I'll work on it today.
See https://github.com/openshift/console/pull/6536 for updated queries.
Checked on ocp 4.6 cluster with payload 4.6.0-0.nightly-2020-09-10-195619. The fix pr is contained. Check on Overview -> Cluster Utilization. Filter "Filesystem" by node, shows "Not Available"; Filter "Pod count" by node, click "View more", the query is "topk(25, sort_desc(sum(avg_over_time(kubelet_running_pods{instance=~"<%= ipAddress %>:.*"}[5m])) BY (node)))" and "No datapoints found" is shown. The issue in Comment 10 is not fixed.
Let me bounce this back to monitoring; they said the queries were correct.
Checked on ocp 4.6 cluster with payload 4.6.0-0.nightly-2020-09-12-230035 Check on Overview -> Cluster Utilization. Filter "Filesystem" by node, click "View more", it opens metrics page and correct data are shown. Filter "Pod count" by node, click "View more", it opens metrics page and correct data are shown.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196