2040635 – CPU Utilisation is negative number for "Kubernetes / Compute Resources / Cluster" dashboard

Bug 2040635 - CPU Utilisation is negative number for "Kubernetes / Compute Resources / Cluster" dashboard

Summary: CPU Utilisation is negative number for "Kubernetes / Compute Resources / Clus...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Haoyu Sun
QA Contact:	Junqi Zhao
Docs Contact:	Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-14 10:35 UTC by Junqi Zhao
Modified:	2022-08-10 10:42 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The dashboard `CPU Utilisation` used the formula `1 - sum(avg by (mode) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode=~"idle\|iowait\|steal"}[%(grafanaIntervalVar)s])))' % $._config` to calculate CPU utilisation of a node, which is possible to turn negative due to interpolation inside Prometheus. Consequence: The dashboard CPU Utilisation may show invalid negative value. Fix: Integrate upstream fix https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/745. Result: The dashboard CPU Utilisation shows correct values again.
Clone Of:
Environment:
Last Closed:	2022-08-10 10:42:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
console dashboard (173.61 KB, image/png) 2022-01-14 10:35 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes-monitoring kubernetes-mixin pull 745	None	Merged	use recording rule metric for cluster cpu utilization	2022-02-25 11:08:14 UTC
Github	openshift cluster-monitoring-operator pull 1571	None	Merged	Bug 2051470: Update prometheus-operator and sync jsonnet	2022-03-01 06:36:09 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:42:52 UTC

Description Junqi Zhao 2022-01-14 10:35:19 UTC

Created attachment 1850773 [details]
console dashboard

Description of problem:
4.10.0-0.nightly-2022-01-13-061145 openstack http_proxy cluster, "Kubernetes / Compute Resources / Cluster" dashboard, CPU Utilisation is negative number both in console dashboard and grafana dashboard

# oc get infrastructures/cluster -o jsonpath="{.spec.platformSpec.type}"
OpenStack

CPU Utilisation expression:
1 - sum(avg by (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal", cluster=""}[5m])))
-0.10089052531835652

checked "cluster:cpu_usage_cores:sum" from prometheus
cluster:cpu_usage_cores:sum{prometheus="openshift-monitoring/k8s"}  7.461142857142807

checked "cluster:capacity_cpu_cores:sum" from prometheus
cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.large", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", prometheus="openshift-monitoring/k8s"}
12
cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.xlarge", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", label_node_role_kubernetes_io="master", prometheus="openshift-monitoring/k8s"}
24

and there are etcdGRPCRequestsSlow alerts in the cluster, this alert is ofen seen in openstack, maybe related to opestack
Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-01-13-061145

How reproducible:
not sure if it is related to openstack, did not see this in other IAAS, and 

Steps to Reproduce:
1. check "Kubernetes / Compute Resources / Cluster" dashboard in console and grafana
2.
3.

Actual results:
CPU Utilisation is negative number for "Kubernetes / Compute Resources / Cluster" dashboard

Expected results:


Additional info:

Comment 3 Arunprasad Rajkumar 2022-02-25 11:08:14 UTC

I'm not about the etcdGRPCRequestSlow alert, but the negative CPU utilisation seem to be related to the expression `1 - sum(...)` where the prometheus interpolation might bring some numbers which would make it to return negative number!

A recent change in k8s-mixin[1] should fix the problem on negative CPU utilisation.

[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/745

Comment 4 Arunprasad Rajkumar 2022-03-01 06:36:55 UTC

The upstream changes have been already pulled as part of https://github.com/openshift/cluster-monitoring-operator/pull/1571

Comment 5 Junqi Zhao 2022-03-08 12:56:17 UTC

checked with 4.11.0-0.nightly-2022-03-04-063157, "Kubernetes / Compute Resources / Cluster" dashboard, CPU Utilisation is now changed to "cluster:node_cpu:ratio_rate5m{cluster=""}", checked in aws/openstack cluster, did not see negative value for CPU Utilisation both in console dashboard and grafana dashboard

Comment 9 errata-xmlrpc 2022-08-10 10:42:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.