Description of problem: Full description with images etc are available here -> https://github.com/openshift/cluster-monitoring-operator/issues/693 I have noticed on an Openshift Cluster using k8s-prometheus-adapter that often I see significant differences between the CPU % reported by kubectl top nodes and those reported by logging onto the host and running top or by the following query in prometheus: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) e.g: top nodes shows 74% busy, whereas top on the node shows ~3% idle (so 97% busy) It looks like the query being used in Openshift 4 is nodeQuery: sum(1 - rate(node_cpu_seconds_total{mode="idle"}[5m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{<<.LabelMatchers>>}) by (<<.GroupBy>>) -> https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-adapter/config-map.yaml#L7 But this does not seem to be accurate. When looking at node CPU I really care about if the node is close to being maxed out - and currently the results from k8s-prometheus-adapter are not accurate, as I see them regularly reporting ~70% busy - whereas in fact the node is running close to 100%. I also checked this out on Kube clusters using heapster & metrics-server and both of those are giving fairly accurate values for kubectl top nodes - so this issue seems specific to k8s-prometheus-adapter Version-Release number of selected component (if applicable): All 4.3 versions I've used How reproducible: Easy Steps to Reproduce: 1.Run a workload on Openshift cluster - I don't think there has to be anything specific about the workload, I see it with most tests I run 2. Run kubectl top nodes and see busy % for a node 3. Either get onto the node and run top and check the idle CPU or use the following prometheus query 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) Actual results: kubectl top nodes output is lower than the other 2 methods. Expected results: top nodes output is similar to the other 2 methods Additional info:
Tested with 4.5.0-0.nightly-2020-03-25-200754, check on one node, there is not much difference for cpu usage for the following two quries 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle",instance="ip-10-0-152-12.ap-northeast-2.compute.internal"}[10m])) * 100) Element Value {instance="ip-10-0-152-12.ap-northeast-2.compute.internal"} 23.26666666667127 # kubectl top node ip-10-0-152-12.ap-northeast-2.compute.internal NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-152-12.ap-northeast-2.compute.internal 899m 25% 4253Mi 29%
*** Bug 1816500 has been marked as a duplicate of this bug. ***
*** Bug 1850270 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409