Bug 2040635 - CPU Utilisation is negative number for "Kubernetes / Compute Resources / Cluster" dashboard
Summary: CPU Utilisation is negative number for "Kubernetes / Compute Resources / Clus...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Haoyu Sun
QA Contact: Junqi Zhao
Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-14 10:35 UTC by Junqi Zhao
Modified: 2022-08-10 10:42 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The dashboard `CPU Utilisation` used the formula `1 - sum(avg by (mode) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode=~"idle|iowait|steal"}[%(grafanaIntervalVar)s])))' % $._config` to calculate CPU utilisation of a node, which is possible to turn negative due to interpolation inside Prometheus. Consequence: The dashboard CPU Utilisation may show invalid negative value. Fix: Integrate upstream fix https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/745. Result: The dashboard CPU Utilisation shows correct values again.
Clone Of:
Environment:
Last Closed: 2022-08-10 10:42:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
console dashboard (173.61 KB, image/png)
2022-01-14 10:35 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 745 0 None Merged use recording rule metric for cluster cpu utilization 2022-02-25 11:08:14 UTC
Github openshift cluster-monitoring-operator pull 1571 0 None Merged Bug 2051470: Update prometheus-operator and sync jsonnet 2022-03-01 06:36:09 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:42:52 UTC

Description Junqi Zhao 2022-01-14 10:35:19 UTC
Created attachment 1850773 [details]
console dashboard

Description of problem:
4.10.0-0.nightly-2022-01-13-061145 openstack http_proxy cluster, "Kubernetes / Compute Resources / Cluster" dashboard, CPU Utilisation is negative number both in console dashboard and grafana dashboard

# oc get infrastructures/cluster -o jsonpath="{.spec.platformSpec.type}"
OpenStack

CPU Utilisation expression:
1 - sum(avg by (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal", cluster=""}[5m])))
-0.10089052531835652

checked "cluster:cpu_usage_cores:sum" from prometheus
cluster:cpu_usage_cores:sum{prometheus="openshift-monitoring/k8s"}  7.461142857142807

checked "cluster:capacity_cpu_cores:sum" from prometheus
cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.large", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", prometheus="openshift-monitoring/k8s"}
12
cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.xlarge", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", label_node_role_kubernetes_io="master", prometheus="openshift-monitoring/k8s"}
24

and there are etcdGRPCRequestsSlow alerts in the cluster, this alert is ofen seen in openstack, maybe related to opestack
Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-01-13-061145

How reproducible:
not sure if it is related to openstack, did not see this in other IAAS, and 

Steps to Reproduce:
1. check "Kubernetes / Compute Resources / Cluster" dashboard in console and grafana
2.
3.

Actual results:
CPU Utilisation is negative number for "Kubernetes / Compute Resources / Cluster" dashboard

Expected results:


Additional info:

Comment 3 Arunprasad Rajkumar 2022-02-25 11:08:14 UTC
I'm not about the etcdGRPCRequestSlow alert, but the negative CPU utilisation seem to be related to the expression `1 - sum(...)` where the prometheus interpolation might bring some numbers which would make it to return negative number!

A recent change in k8s-mixin[1] should fix the problem on negative CPU utilisation.

[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/745

Comment 4 Arunprasad Rajkumar 2022-03-01 06:36:55 UTC
The upstream changes have been already pulled as part of https://github.com/openshift/cluster-monitoring-operator/pull/1571

Comment 5 Junqi Zhao 2022-03-08 12:56:17 UTC
checked with 4.11.0-0.nightly-2022-03-04-063157, "Kubernetes / Compute Resources / Cluster" dashboard, CPU Utilisation is now changed to "cluster:node_cpu:ratio_rate5m{cluster=""}", checked in aws/openstack cluster, did not see negative value for CPU Utilisation both in console dashboard and grafana dashboard

Comment 9 errata-xmlrpc 2022-08-10 10:42:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.