Bug 2040635

Summary:

CPU Utilisation is negative number for "Kubernetes / Compute Resources / Cluster" dashboard

Product:

OpenShift Container Platform

Reporter:

Junqi Zhao <juzhao>

Component:

Monitoring

Assignee:

Haoyu Sun <hasun>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Brian Burt <bburt>

Priority:

medium

Version:

4.10

CC:

amuller, anpicker, aos-bugs, bburt, spasquie

Target Milestone:

---

Target Release:

4.11.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: The dashboard `CPU Utilisation` used the formula `1 - sum(avg by (mode) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode=~"idle|iowait|steal"}[%(grafanaIntervalVar)s])))' % $._config` to calculate CPU utilisation of a node, which is possible to turn negative due to interpolation inside Prometheus. Consequence: The dashboard CPU Utilisation may show invalid negative value. Fix: Integrate upstream fix https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/745. Result: The dashboard CPU Utilisation shows correct values again.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-08-10 10:42:31 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
console dashboard	none

Description Junqi Zhao 2022-01-14 10:35:19 UTC

Created attachment 1850773 [details]
console dashboard

Description of problem:
4.10.0-0.nightly-2022-01-13-061145 openstack http_proxy cluster, "Kubernetes / Compute Resources / Cluster" dashboard, CPU Utilisation is negative number both in console dashboard and grafana dashboard

# oc get infrastructures/cluster -o jsonpath="{.spec.platformSpec.type}"
OpenStack

CPU Utilisation expression:
1 - sum(avg by (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal", cluster=""}[5m])))
-0.10089052531835652

checked "cluster:cpu_usage_cores:sum" from prometheus
cluster:cpu_usage_cores:sum{prometheus="openshift-monitoring/k8s"}  7.461142857142807

checked "cluster:capacity_cpu_cores:sum" from prometheus
cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.large", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", prometheus="openshift-monitoring/k8s"}
12
cluster:capacity_cpu_cores:sum{label_beta_kubernetes_io_instance_type="ci.m1.xlarge", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", label_node_role_kubernetes_io="master", prometheus="openshift-monitoring/k8s"}
24

and there are etcdGRPCRequestsSlow alerts in the cluster, this alert is ofen seen in openstack, maybe related to opestack
Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-01-13-061145

How reproducible:
not sure if it is related to openstack, did not see this in other IAAS, and 

Steps to Reproduce:
1. check "Kubernetes / Compute Resources / Cluster" dashboard in console and grafana
2.
3.

Actual results:
CPU Utilisation is negative number for "Kubernetes / Compute Resources / Cluster" dashboard

Expected results:


Additional info:

Comment 3 Arunprasad Rajkumar 2022-02-25 11:08:14 UTC

I'm not about the etcdGRPCRequestSlow alert, but the negative CPU utilisation seem to be related to the expression `1 - sum(...)` where the prometheus interpolation might bring some numbers which would make it to return negative number!

A recent change in k8s-mixin[1] should fix the problem on negative CPU utilisation.

[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/745

Comment 4 Arunprasad Rajkumar 2022-03-01 06:36:55 UTC

The upstream changes have been already pulled as part of https://github.com/openshift/cluster-monitoring-operator/pull/1571

Comment 5 Junqi Zhao 2022-03-08 12:56:17 UTC

checked with 4.11.0-0.nightly-2022-03-04-063157, "Kubernetes / Compute Resources / Cluster" dashboard, CPU Utilisation is now changed to "cluster:node_cpu:ratio_rate5m{cluster=""}", checked in aws/openstack cluster, did not see negative value for CPU Utilisation both in console dashboard and grafana dashboard

Comment 9 errata-xmlrpc 2022-08-10 10:42:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069