Bug 1845710
| Summary: | Openshift overview GUI reports -400,000,000,000m CPU when control plane rebooted s390x | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Tom Dale <tdale> |
| Component: | Monitoring | Assignee: | Lili Cosic <lcosic> |
| Status: | CLOSED WORKSFORME | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.4 | CC: | alegrand, alklein, anpicker, cfillekes, chanphil, christian.lapolt, erooth, Holger.Wolf, kakkoyun, krmoser, lcosic, mloibl, nbziouec, pkrupa, rcgingra, spasquie, surbania, tdale |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | s390x | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-25 13:31:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
Created attachment 1696385 [details]
Normal ocpl monitoring -> metrics page showing different results for CPU utilization
Created attachment 1696386 [details]
Found that montiring page shows different value -41m .
As this is an edge case that happens only in certain conditions lowering priority of it. But we will fix it in 4.6 release. As this is a recording rule, can you provide the indivual values for the metrics from the recording rule:
sum(1 - rate(node_cpu_seconds_total{mode="idle"}[2m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{pod=~"node-exporter.+"})
So the value of node_cpu_seconds_total for that specific time when the negative value happened, thanks!
Created attachment 1696523 [details]
sum(1 - rate(node_cpu_seconds_total{mode="idle"}[2m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{pod=~"node-exporter.+"})
Is this latest attatchment what you want for the recording rule? The graph looks the same as the default that showed negative CPU usage. My knowledge of ocp monitoring is limited. Please let me know how else I can help.
No worries, just need the value of the metric used in the recording rule, so if you can just provide raw value of this metric -> node_cpu_seconds_total for that time, thanks! Created attachment 1696782 [details]
node_cpu_seconds_total
See attached graph of both node_cpu_seconds and cluster:cpu_usage_cores:sum. The values of node_cpu_seconds_total around the point when cluster:cpu_usage_cores:sum becomes negative ar as follows: 595017.338 then 0 then 4504194536.01 at the peak, then back to 59089.15. Besides the drastic dip then uptick values all stay around 59000.
Created attachment 1696792 [details]
Raw values of node_cpu_seconds_total
Also note that node_namespace_pod:kube_pod_info:{pod=~"node-exporter.+"} reports values of "1" throughout.
It looks like the raw counter values reset randomly which causes Prometheus to assume that it missed some scrapes and then apply a wrong extrapolation. I can see at least 2 different explanations for this: 1. the kernel exposes buggy values for CPU counters. 2. node_exporter has a bug when it translates the values exposed by the kernel to Prometheus. What would be interesting is get a capture of the raw samples over a small time interval. Please provide the result of the following Prometheus query at a point in time when the problem arises: node_cpu_seconds_total[5m] I haven't been able to replicate this issue now. Perhaps its been fixed. I will reopen if I see this problem again. Thanks |
Created attachment 1696384 [details] Overview page with huge negative number Description of problem: After rebooting all master nodes ocp monitoring shows huge negative CPU utilization on the main page during the time that master nodes were offline. Should OCP give a warning or at least show 0m when the control plane was unreachable. Version-Release number of selected component (if applicable): Server Version: 4.4.0-0.nightly-s390x-2020-05-25-145353 How reproducible: Every time Steps to Reproduce: 1. ssh into master nodes and reboot quickly so that all masters are down at one point. 2. Look at ocp main overview page. Attached is a screenshot of the huge negative CPU usage on the overview page. The same data looks fine when looking at the monitoring->metrics GUI page.This seems to point to a problem just with the main OCP page.