Bug 1812004

Summary: Node CPU stats are not accurate in Openshift 4.3
Product: OpenShift Container Platform Reporter: Dan McGinnes <MCGINNES>
Component: MonitoringAssignee: Pawel Krupa <pkrupa>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: alegrand, anpicker, erooth, kakkoyun, lcosic, mdhanve, mloibl, pkrupa, surbania, syangsao
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: usage of rate() was smoothing statistics data Consequence: spikes in CPU usage weren't represented Fix: rate() was changed to irate() Result: spikes are shown and `oc adm top` UX is similar to linux `top` utility
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:19:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan McGinnes 2020-03-10 11:26:05 UTC
Description of problem:
Full description with images etc are available here -> https://github.com/openshift/cluster-monitoring-operator/issues/693

I have noticed on an Openshift Cluster using k8s-prometheus-adapter that often I see significant differences between the CPU % reported by kubectl top nodes and those reported by logging onto the host and running top or by the following query in prometheus:
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m])) * 100)

e.g:
top nodes shows 74% busy, whereas top on the node shows ~3% idle (so 97% busy)

It looks like the query being used in Openshift 4 is nodeQuery: sum(1 - rate(node_cpu_seconds_total{mode="idle"}[5m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{<<.LabelMatchers>>}) by (<<.GroupBy>>) -> https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-adapter/config-map.yaml#L7

But this does not seem to be accurate.

When looking at node CPU I really care about if the node is close to being maxed out - and currently the results from k8s-prometheus-adapter are not accurate, as I see them regularly reporting ~70% busy - whereas in fact the node is running close to 100%.

I also checked this out on Kube clusters using heapster & metrics-server and both of those are giving fairly accurate values for kubectl top nodes - so this issue seems specific to k8s-prometheus-adapter


Version-Release number of selected component (if applicable):
All 4.3 versions I've used


How reproducible:
Easy


Steps to Reproduce:
1.Run a workload on Openshift cluster - I don't think there has to be anything specific about the workload, I see it with most tests I run
2. Run kubectl top nodes and see busy % for a node
3. Either get onto the node and run top and check the idle CPU or use the following prometheus query 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) 

Actual results:
kubectl top nodes output is lower than the other 2 methods.

Expected results:
top nodes output is similar to the other 2 methods

Additional info:

Comment 3 Junqi Zhao 2020-03-26 09:21:32 UTC
Tested with 4.5.0-0.nightly-2020-03-25-200754, check on one node, there is not much difference for cpu usage for the  following two quries

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle",instance="ip-10-0-152-12.ap-northeast-2.compute.internal"}[10m])) * 100)
Element 	Value
{instance="ip-10-0-152-12.ap-northeast-2.compute.internal"}	23.26666666667127

# kubectl top node ip-10-0-152-12.ap-northeast-2.compute.internal
NAME                                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-0-152-12.ap-northeast-2.compute.internal   899m         25%    4253Mi          29%

Comment 4 Pawel Krupa 2020-03-26 13:48:10 UTC
*** Bug 1816500 has been marked as a duplicate of this bug. ***

Comment 5 Samuel Padgett 2020-06-26 12:55:41 UTC
*** Bug 1850270 has been marked as a duplicate of this bug. ***

Comment 7 errata-xmlrpc 2020-07-13 17:19:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409