Bug 1812004 - Node CPU stats are not accurate in Openshift 4.3
Summary: Node CPU stats are not accurate in Openshift 4.3
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.3.z
Hardware: All
OS: Linux
unspecified
low
Target Milestone: ---
: 4.5.0
Assignee: Pawel Krupa
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1816500 1850270 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-10 11:26 UTC by Dan McGinnes
Modified: 2020-07-31 09:10 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: usage of rate() was smoothing statistics data Consequence: spikes in CPU usage weren't represented Fix: rate() was changed to irate() Result: spikes are shown and `oc adm top` UX is similar to linux `top` utility
Clone Of:
Environment:
Last Closed: 2020-07-13 17:19:21 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 720 None closed update prometheus-operator to v0.38.0 2020-08-05 23:23:43 UTC

Description Dan McGinnes 2020-03-10 11:26:05 UTC
Description of problem:
Full description with images etc are available here -> https://github.com/openshift/cluster-monitoring-operator/issues/693

I have noticed on an Openshift Cluster using k8s-prometheus-adapter that often I see significant differences between the CPU % reported by kubectl top nodes and those reported by logging onto the host and running top or by the following query in prometheus:
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m])) * 100)

e.g:
top nodes shows 74% busy, whereas top on the node shows ~3% idle (so 97% busy)

It looks like the query being used in Openshift 4 is nodeQuery: sum(1 - rate(node_cpu_seconds_total{mode="idle"}[5m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{<<.LabelMatchers>>}) by (<<.GroupBy>>) -> https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-adapter/config-map.yaml#L7

But this does not seem to be accurate.

When looking at node CPU I really care about if the node is close to being maxed out - and currently the results from k8s-prometheus-adapter are not accurate, as I see them regularly reporting ~70% busy - whereas in fact the node is running close to 100%.

I also checked this out on Kube clusters using heapster & metrics-server and both of those are giving fairly accurate values for kubectl top nodes - so this issue seems specific to k8s-prometheus-adapter


Version-Release number of selected component (if applicable):
All 4.3 versions I've used


How reproducible:
Easy


Steps to Reproduce:
1.Run a workload on Openshift cluster - I don't think there has to be anything specific about the workload, I see it with most tests I run
2. Run kubectl top nodes and see busy % for a node
3. Either get onto the node and run top and check the idle CPU or use the following prometheus query 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) 

Actual results:
kubectl top nodes output is lower than the other 2 methods.

Expected results:
top nodes output is similar to the other 2 methods

Additional info:

Comment 3 Junqi Zhao 2020-03-26 09:21:32 UTC
Tested with 4.5.0-0.nightly-2020-03-25-200754, check on one node, there is not much difference for cpu usage for the  following two quries

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle",instance="ip-10-0-152-12.ap-northeast-2.compute.internal"}[10m])) * 100)
Element 	Value
{instance="ip-10-0-152-12.ap-northeast-2.compute.internal"}	23.26666666667127

# kubectl top node ip-10-0-152-12.ap-northeast-2.compute.internal
NAME                                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-0-152-12.ap-northeast-2.compute.internal   899m         25%    4253Mi          29%

Comment 4 Pawel Krupa 2020-03-26 13:48:10 UTC
*** Bug 1816500 has been marked as a duplicate of this bug. ***

Comment 5 Samuel Padgett 2020-06-26 12:55:41 UTC
*** Bug 1850270 has been marked as a duplicate of this bug. ***

Comment 7 errata-xmlrpc 2020-07-13 17:19:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.