Bug 1867034

Summary: Filesystem and Pod count are No datapoints found for node
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: Management ConsoleAssignee: ralpert
Status: CLOSED ERRATA QA Contact: Yanping Zhang <yanpzhan>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.6CC: aos-bugs, jokerman, yapei
Target Milestone: ---Keywords: Regression, UpcomingSprint
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:25:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Filesystem and Pod count are No datapoints found for node none

Description Junqi Zhao 2020-08-07 08:23:33 UTC
Created attachment 1710759 [details]
Filesystem and Pod count are No datapoints found for node

Description of problem:
cluster admin, login console, goto "Compute -> Nodes" and select one node to check the node overview.
see from picture, Filesystem and Pod count are No datapoints found for node
checked from API
Filesystem: instance:node_filesystem_usage:sum{instance='qe-anusaxen10-nhx4f-master-0'}
Pod count: kubelet_running_pod_count{instance=~'139.178.76.49:.*'}

these metrics are not in prometheus now, we should use other metrics
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep "kubelet_running_pod_count"
no result

# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep "instance:node_filesystem_usage:sum"
no result

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-06-131904

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Junqi Zhao 2020-08-07 08:33:46 UTC
https://github.com/kubernetes/sig-release/blob/858fc2d68c731352df9ab94b5160436deccf5eab/releases/release-1.19/release-notes-draft.md
Kubelet: following metrics have been renamed: kubelet_running_container_count --> kubelet_running_containers kubelet_running_pod_count --> kubelet_running_pods (#92407, @RainbowMango) [SIG API Machinery, Cluster Lifecycle, Instrumentation and Node]

Comment 2 Jakub Hadvig 2020-08-07 10:18:08 UTC
*** Bug 1865741 has been marked as a duplicate of this bug. ***

Comment 3 Jakub Hadvig 2020-08-07 10:21:51 UTC
Ranaming kubelet_running_pod_count --> kubelet_running_pods is fixing the issue with the pod counts.
Other issue is the FileSystem Usage since the `instance:node_filesystem_usage:sum` was removed completely
from the Prometheus Operator in https://github.com/prometheus-operator/kube-prometheus/pull/617

Comment 4 ralpert 2020-08-20 15:28:49 UTC
I have fixes ready for the pods issues, but I'm waiting to hear back from the monitoring team on what query we should be using for the filesystem issue.

Comment 5 Jakub Hadvig 2020-08-27 09:34:06 UTC
*** Bug 1873044 has been marked as a duplicate of this bug. ***

Comment 9 Yadan Pei 2020-08-31 06:25:06 UTC
The latest accepted 4.6 nightly 4.6.0-0.nightly-2020-08-27-005538 doesn't include the fix PR 

# export PAYLOAD=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-27-005538
[root@preserved-qe-ui-rhel-1 console]# oc adm release info $PAYLOAD --pullspecs | grep console
  console                                        quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3a3ccd44b0545258785c52d90d43a2bebc80365f52a7f2eb28601a2957020310
  console-operator                               quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:741428633e21bddb5f6b86970a6d62f9ed2b762bafd074f2bb7a09aa3aaf5d0a
[root@preserved-qe-ui-rhel-1 console]# export CONSOLE_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3a3ccd44b0545258785c52d90d43a2bebc80365f52a7f2eb28601a2957020310
[root@preserved-qe-ui-rhel-1 console]# oc image info $CONSOLE_IMAGE | grep commit
             io.openshift.build.commit.id=9cc959b2ec10e83dd0f9c6dc0e70c0ae4fb03daf
             io.openshift.build.commit.url=https://github.com/openshift/console/commit/9cc959b2ec10e83dd0f9c6dc0e70c0ae4fb03daf
[root@preserved-qe-ui-rhel-1 console]# git log 9cc959b2ec10e83dd0f9c6dc0e70c0ae4fb03daf | grep '#6419'   //nothing returns

Comment 10 Yadan Pei 2020-08-31 08:04:03 UTC
nightly build 4.6.0-0.nightly-2020-08-31-012413 contains the fix

In Node Utilization charts, Filesystem and Pods are using the queries in PR, but in Cluster Utilization charts, there are still some issues: 
1) Filesystem -> By Node shows no data
2) Pods -> By Node is still using 'topk(25, sort_desc(sum(avg_over_time(kube_pod_info[5m])) BY (node)))' which is different from the query in Node Utilization charts 'kubelet_running_pods{instance=~'10.0.150.132:.*'}'

Assigning back for another fix

Comment 11 ralpert 2020-08-31 13:30:22 UTC
Thanks for flagging this; I'll work on it today.

Comment 12 ralpert 2020-09-04 20:47:21 UTC
See https://github.com/openshift/console/pull/6536 for updated queries.

Comment 14 Yanping Zhang 2020-09-11 02:58:12 UTC
Checked on ocp 4.6 cluster with payload 4.6.0-0.nightly-2020-09-10-195619.
The fix pr is contained.
Check on Overview -> Cluster Utilization.
Filter "Filesystem" by node, shows "Not Available"; 
Filter "Pod count" by node, click "View more", the query is "topk(25, sort_desc(sum(avg_over_time(kubelet_running_pods{instance=~"<%= ipAddress %>:.*"}[5m])) BY (node)))" and "No datapoints found" is shown.
The issue in Comment 10 is not fixed.

Comment 15 ralpert 2020-09-11 14:47:40 UTC
Let me bounce this back to monitoring; they said the queries were correct.

Comment 17 Yanping Zhang 2020-09-14 07:08:32 UTC
Checked on ocp 4.6 cluster with payload 4.6.0-0.nightly-2020-09-12-230035
Check on Overview -> Cluster Utilization.
Filter "Filesystem" by node, click "View more", it opens metrics page and correct data are shown.
Filter "Pod count" by node, click "View more", it opens metrics page and correct data are shown.

Comment 19 errata-xmlrpc 2020-10-27 16:25:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196