Created attachment 1850773 [details] console dashboard Description of problem: 4.10.0-0.nightly-2022-01-13-061145 openstack http_proxy cluster, "Kubernetes / Compute Resources / Cluster" dashboard, CPU Utilisation is negative number both in console dashboard and grafana dashboard # oc get infrastructures/cluster -o jsonpath="{.spec.platformSpec.type}" OpenStack CPU Utilisation expression: 1 - sum(avg by (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal", cluster=""}[5m]))) -0.10089052531835652 checked "cluster:cpu_usage_cores:sum" from prometheus cluster:cpu_usage_cores:sum{prometheus="openshift-monitoring/k8s"} 7.461142857142807 checked "cluster:capacity_cpu_cores:sum" from prometheus cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.large", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", prometheus="openshift-monitoring/k8s"} 12 cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.xlarge", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", label_node_role_kubernetes_io="master", prometheus="openshift-monitoring/k8s"} 24 and there are etcdGRPCRequestsSlow alerts in the cluster, this alert is ofen seen in openstack, maybe related to opestack Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2022-01-13-061145 How reproducible: not sure if it is related to openstack, did not see this in other IAAS, and Steps to Reproduce: 1. check "Kubernetes / Compute Resources / Cluster" dashboard in console and grafana 2. 3. Actual results: CPU Utilisation is negative number for "Kubernetes / Compute Resources / Cluster" dashboard Expected results: Additional info:
I'm not about the etcdGRPCRequestSlow alert, but the negative CPU utilisation seem to be related to the expression `1 - sum(...)` where the prometheus interpolation might bring some numbers which would make it to return negative number! A recent change in k8s-mixin[1] should fix the problem on negative CPU utilisation. [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/745
The upstream changes have been already pulled as part of https://github.com/openshift/cluster-monitoring-operator/pull/1571
checked with 4.11.0-0.nightly-2022-03-04-063157, "Kubernetes / Compute Resources / Cluster" dashboard, CPU Utilisation is now changed to "cluster:node_cpu:ratio_rate5m{cluster=""}", checked in aws/openstack cluster, did not see negative value for CPU Utilisation both in console dashboard and grafana dashboard
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069