Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1845710

Summary: Openshift overview GUI reports -400,000,000,000m CPU when control plane rebooted s390x
Product: OpenShift Container Platform Reporter: Tom Dale <tdale>
Component: MonitoringAssignee: Lili Cosic <lcosic>
Status: CLOSED WORKSFORME QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 4.4CC: alegrand, alklein, anpicker, cfillekes, chanphil, christian.lapolt, erooth, Holger.Wolf, kakkoyun, krmoser, lcosic, mloibl, nbziouec, pkrupa, rcgingra, spasquie, surbania, tdale
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-25 13:31:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Overview page with huge negative number
none
Normal ocpl monitoring -> metrics page showing different results for CPU utilization
none
Found that montiring page shows different value -41m .
none
sum(1 - rate(node_cpu_seconds_total{mode="idle"}[2m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{pod=~"node-exporter.+"})
none
node_cpu_seconds_total
none
Raw values of node_cpu_seconds_total none

Description Tom Dale 2020-06-09 20:52:41 UTC
Created attachment 1696384 [details]
Overview page with huge negative number

Description of problem:
After rebooting all master nodes ocp monitoring shows huge negative CPU utilization on the main page during the time that master nodes were offline. Should OCP give a warning or at least show 0m when the control plane was unreachable.

Version-Release number of selected component (if applicable):
Server Version: 4.4.0-0.nightly-s390x-2020-05-25-145353


How reproducible: Every time


Steps to Reproduce:
1. ssh into master nodes and reboot quickly so that all masters are down at one point.
2. Look at ocp main overview page.

Attached is a screenshot of the huge negative CPU usage on the overview page. The same data looks fine when looking at the monitoring->metrics GUI page.This seems to point to a problem just with the main OCP page.

Comment 1 Tom Dale 2020-06-09 20:54:17 UTC
Created attachment 1696385 [details]
Normal ocpl monitoring -> metrics page showing different results for CPU utilization

Comment 2 Tom Dale 2020-06-09 20:58:04 UTC
Created attachment 1696386 [details]
Found that montiring page shows different value -41m .

Comment 4 Lili Cosic 2020-06-10 07:56:35 UTC
As this is an edge case that happens only in certain conditions lowering priority of it. But we will fix it in 4.6 release.

Comment 5 Lili Cosic 2020-06-10 10:45:47 UTC
As this is a recording rule, can you provide the indivual values for the metrics from the recording rule:

sum(1 - rate(node_cpu_seconds_total{mode="idle"}[2m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{pod=~"node-exporter.+"})


So the value of node_cpu_seconds_total for that specific time when the negative value happened, thanks!

Comment 6 Tom Dale 2020-06-10 15:02:16 UTC
Created attachment 1696523 [details]
sum(1 - rate(node_cpu_seconds_total{mode="idle"}[2m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{pod=~"node-exporter.+"})

Is this latest attatchment what you want for the recording rule? The graph looks the same as the default that showed negative CPU usage. My knowledge of ocp monitoring is limited. Please let me know how else I can help.

Comment 7 Lili Cosic 2020-06-10 15:33:55 UTC
No worries, just need the value of the metric used in the recording rule, so if you can just provide raw value of this metric -> node_cpu_seconds_total for that time, thanks!

Comment 8 Tom Dale 2020-06-11 13:39:23 UTC
Created attachment 1696782 [details]
node_cpu_seconds_total

See attached graph of both node_cpu_seconds and cluster:cpu_usage_cores:sum. The values of node_cpu_seconds_total around the point when cluster:cpu_usage_cores:sum becomes negative ar as follows: 595017.338 then 0 then 4504194536.01 at the peak, then back to 59089.15. Besides the drastic dip then uptick values all stay around 59000.

Comment 9 Tom Dale 2020-06-11 14:10:04 UTC
Created attachment 1696792 [details]
Raw values of node_cpu_seconds_total

Comment 10 Tom Dale 2020-06-11 14:11:08 UTC
Also note that node_namespace_pod:kube_pod_info:{pod=~"node-exporter.+"} reports values of "1" throughout.

Comment 16 Simon Pasquier 2020-09-11 12:26:31 UTC
It looks like the raw counter values reset randomly which causes Prometheus to assume that it missed some scrapes and then apply a wrong extrapolation.

I can see at least 2 different explanations for this:
1. the kernel exposes buggy values for CPU counters.
2. node_exporter has a bug when it translates the values exposed by the kernel to Prometheus.

What would be interesting is get a capture of the raw samples over a small time interval.
Please provide the result of the following Prometheus query at a point in time when the problem arises:
node_cpu_seconds_total[5m]

Comment 17 Tom Dale 2020-09-25 13:31:52 UTC
I haven't been able to replicate this issue now. Perhaps its been fixed. I will reopen if I see this problem again. Thanks