Cause:
The dashboard `CPU Utilisation` used the formula `1 - sum(avg by (mode) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode=~"idle|iowait|steal"}[%(grafanaIntervalVar)s])))' % $._config` to calculate CPU utilisation of a node, which is possible to turn negative due to interpolation inside Prometheus.
Consequence:
The dashboard CPU Utilisation may show invalid negative value.
Fix:
Integrate upstream fix https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/745.
Result:
The dashboard CPU Utilisation shows correct values again.
Created attachment 1850773[details]
console dashboard
Description of problem:
4.10.0-0.nightly-2022-01-13-061145 openstack http_proxy cluster, "Kubernetes / Compute Resources / Cluster" dashboard, CPU Utilisation is negative number both in console dashboard and grafana dashboard
# oc get infrastructures/cluster -o jsonpath="{.spec.platformSpec.type}"
OpenStack
CPU Utilisation expression:
1 - sum(avg by (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal", cluster=""}[5m])))
-0.10089052531835652
checked "cluster:cpu_usage_cores:sum" from prometheus
cluster:cpu_usage_cores:sum{prometheus="openshift-monitoring/k8s"} 7.461142857142807
checked "cluster:capacity_cpu_cores:sum" from prometheus
cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.large", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", prometheus="openshift-monitoring/k8s"}
12
cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.xlarge", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", label_node_role_kubernetes_io="master", prometheus="openshift-monitoring/k8s"}
24
and there are etcdGRPCRequestsSlow alerts in the cluster, this alert is ofen seen in openstack, maybe related to opestack
Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-01-13-061145
How reproducible:
not sure if it is related to openstack, did not see this in other IAAS, and
Steps to Reproduce:
1. check "Kubernetes / Compute Resources / Cluster" dashboard in console and grafana
2.
3.
Actual results:
CPU Utilisation is negative number for "Kubernetes / Compute Resources / Cluster" dashboard
Expected results:
Additional info:
Comment 3Arunprasad Rajkumar
2022-02-25 11:08:14 UTC
I'm not about the etcdGRPCRequestSlow alert, but the negative CPU utilisation seem to be related to the expression `1 - sum(...)` where the prometheus interpolation might bring some numbers which would make it to return negative number!
A recent change in k8s-mixin[1] should fix the problem on negative CPU utilisation.
[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/745
Comment 4Arunprasad Rajkumar
2022-03-01 06:36:55 UTC
checked with 4.11.0-0.nightly-2022-03-04-063157, "Kubernetes / Compute Resources / Cluster" dashboard, CPU Utilisation is now changed to "cluster:node_cpu:ratio_rate5m{cluster=""}", checked in aws/openstack cluster, did not see negative value for CPU Utilisation both in console dashboard and grafana dashboard
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2022:5069
Created attachment 1850773 [details] console dashboard Description of problem: 4.10.0-0.nightly-2022-01-13-061145 openstack http_proxy cluster, "Kubernetes / Compute Resources / Cluster" dashboard, CPU Utilisation is negative number both in console dashboard and grafana dashboard # oc get infrastructures/cluster -o jsonpath="{.spec.platformSpec.type}" OpenStack CPU Utilisation expression: 1 - sum(avg by (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal", cluster=""}[5m]))) -0.10089052531835652 checked "cluster:cpu_usage_cores:sum" from prometheus cluster:cpu_usage_cores:sum{prometheus="openshift-monitoring/k8s"} 7.461142857142807 checked "cluster:capacity_cpu_cores:sum" from prometheus cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.large", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", prometheus="openshift-monitoring/k8s"} 12 cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.xlarge", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", label_node_role_kubernetes_io="master", prometheus="openshift-monitoring/k8s"} 24 and there are etcdGRPCRequestsSlow alerts in the cluster, this alert is ofen seen in openstack, maybe related to opestack Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2022-01-13-061145 How reproducible: not sure if it is related to openstack, did not see this in other IAAS, and Steps to Reproduce: 1. check "Kubernetes / Compute Resources / Cluster" dashboard in console and grafana 2. 3. Actual results: CPU Utilisation is negative number for "Kubernetes / Compute Resources / Cluster" dashboard Expected results: Additional info: