1878766 – CPU consumption on nodes is higher than the CPU count of the node.

Bug 1878766 - CPU consumption on nodes is higher than the CPU count of the node.

Summary: CPU consumption on nodes is higher than the CPU count of the node.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	s390x
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1878164 (view as bug list)
Depends On:
Blocks:	ocp-46-z-tracker 1910006
TreeView+	depends on / blocked

Reported:	2020-09-14 13:29 UTC by wvoesch
Modified:	2021-02-24 17:19 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: the metrics reporting CPU consumption didn't account for CPU steal time which the time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor. Consequence: the reported CPU usage could be higher than the number of CPU count on the node. Fix: the metrics reporting CPU consumption don't take into account the CPU steal time anymore. Result: the reported CPU usage reflects the real usage.
Clone Of:
Environment:
Last Closed:	2021-02-24 15:17:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
screenshot of a higher CPU consumption than the CPU count of that node. (244.03 KB, image/png) 2020-09-14 13:29 UTC, wvoesch	no flags	Details
does not find the issue on AWS (158.43 KB, image/png) 2020-09-21 02:58 UTC, Junqi Zhao	no flags	Details
master-0 node info. (72.51 KB, image/png) 2020-10-05 11:22 UTC, alisha	no flags	Details
available cpu is negative value, reproduced on AWS (93.89 KB, image/png) 2020-12-18 03:37 UTC, hongyan li	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 993	None	closed	Add Thanos sidecar metrics + alerts	2021-01-24 17:01:01 UTC
Github	prometheus-operator kube-prometheus pull 796	None	closed	CPU steal time shouldn't be counted into CPU usage	2021-01-24 17:01:01 UTC
IBM Linux Technology Center	188286	None	None	None	2020-09-24 09:41:10 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:18:22 UTC

Description wvoesch 2020-09-14 13:29:10 UTC

Created attachment 1714793 [details]
screenshot of a higher CPU consumption than the CPU count of that node.

Description of problem:

On several clusters I have observed, that the OCP dashboard shows a higher CPU consumption than a) the actual consumption b) what is actually configured to a particular node. 
For example, a master node with 4 CPUs shows a CPU consumption of 8.37 CPUs and -4,370.686m CPU is still available. Please see the screenshot.


I observed this three different clusters. 

The clusters include nodes with ECKD and ZCP disks, direct attached OSA and Hipersockes. Typically the clusters have three masters and 2 workers, of size 2 CPUs and 8 GB. But one cluster has 3 workers with 10 CPUs and 32 GB each. 

Because of the size, I just added some basic information below. Please let me know what information you need for further debugging and I shall provide it happily. 
Thank you. 


Additional info:

###############################################################################################################################################
Cluster-identifyer: s8343008, Version: 4.6.0-0.nightly-s390x-2020-09-05-222506 RHCOS: 46.82.202009042339-0
Environment: Z13, 3 masters [16 GB, 4 CPUs], 2 workers [8 GB, 2 CPUs], 4x fcp 120 GB, 1x 50 GB dasd, direct attached OSA, age: 3d21h


master 3: configured 4 CPUs
OCP dashboard: CPU consumption ~6.09 of 4. -2,052.037m available 

on the node:
top - 08:43:24 up 3 days, 22:14,  1 user,  load average: 47.41, 40.55, 36.88
Tasks: 322 total,   5 running, 314 sleeping,   0 stopped,   3 zombie
%Cpu(s): 27.6 us,  9.7 sy,  0.0 ni, 14.9 id,  0.1 wa,  1.6 hi,  1.7 si, 44.3 st
MiB Mem :  16097.1 total,    101.4 free,   9331.9 used,   6663.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   9982.2 avail Mem

note the 3 zombies, on the other masters there are 3 and 2 respecitvely. 

# sudo hyptop
system   #cpu    cpu   Cpu+   online memuse memmax wcur
(str)     (#)    (%)   (hm)    (dhm)  (GiB)  (GiB)  (#)
S8343012    4 272.16 485:02 21:18:53  15.76  16.00  100
=:V:N       4 272.16 485:02 21:18:53  15.76  16.00  100

###############################################################################################################################################
Cluster-identifyer: s8343001, Version:     4.6.0-0.nightly-s390x-2020-09-07-065507 RHCOS: 46.82.202009042339-0
Environment: Z13, 3 masters [16 GB, 4 CPUs], 3 workers [32 GB, 10 CPUs], 6x fcp 120 GB, Hipersockets, age: 3d23h

master 1: configured 4 CPUs
OCP dashboard: CPU consumption ~4.95  of 4. -947.756m available  

top - 12:52:43 up 3 days,  5:53,  1 user,  load average: 30.24, 25.30, 24.83
Tasks: 308 total,   4 running, 303 sleeping,   0 stopped,   1 zombie
%Cpu(s): 23.0 us,  7.5 sy,  0.0 ni,  7.7 id,  0.1 wa,  1.0 hi,  1.4 si, 59.2 st
MiB Mem :  15962.1 total,    711.8 free,   8834.4 used,   6415.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  11069.8 avail Mem

hyptop:
system   #cpu    cpu   Cpu+   online memuse memmax wcur
(str)     (#)    (%)   (hm)    (dhm)  (GiB)  (GiB)  (#)
S8343003    4 268.16 494:21 52:21:10  15.82  15.87  100
=:V:N       4 268.16 494:21 52:21:10  15.82  15.87  100

###############################################################################################################################################
Cluster-identifyer: s8343022, Version: 4.6.0-0.nightly-s390x-2020-08-27-080214 RHCOS: 46.82.202008261939-0
Environment: Z13, 3 masters [16 GB, 4 CPUs], 2 workers [8 GB, 2 CPUs], 5x fcp 120 GB, direct attached OSA, age: 14d

master 2: configured 4 CPUs
OCP dashboard: CPU consumption ~5.034  of 4. -1,043.793m available  

top - 08:56:39 up 14 days, 17:38,  1 user,  load average: 22.24, 26.02, 25.87
Tasks: 332 total,   1 running, 329 sleeping,   0 stopped,   2 zombie
%Cpu(s): 19.7 us,  7.5 sy,  0.0 ni, 41.4 id,  0.5 wa,  1.2 hi,  1.3 si, 28.3 st
MiB Mem :  16097.1 total,    140.4 free,  11375.0 used,   4581.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   9145.1 avail Mem

hyptop:
system   #cpu    cpu    Cpu+   online memuse memmax wcur
(str)     (#)    (%)    (hm)    (dhm)  (GiB)  (GiB)  (#)
S8343025    4 200.61 1164:01 65:00:21  15.77  16.00  100
=:V:N       4 200.61 1164:01 65:00:21  15.77  16.00  100


worker 1: configured 2 CPUs
OC dashboard: COU consumption ~3.2 of 2. -1,113.494m available  
top - 12:13:42 up 14 days, 20:46,  1 user,  load average: 31.80, 27.86, 29.22
Tasks: 833 total,   7 running, 279 sleeping,   0 stopped, 547 zombie
%Cpu(s): 26.0 us, 12.4 sy,  0.0 ni,  0.6 id,  0.3 wa,  0.5 hi,  1.3 si, 58.9 st
MiB Mem :   8022.1 total,     57.8 free,   5968.1 used,   1996.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1915.7 avail Mem

system   #cpu    cpu   Cpu+   online memuse memmax wcur
(str)     (#)    (%)   (hm)    (dhm)  (GiB)  (GiB)  (#)
S8343027    2 158.48 578:13 65:03:38   7.68   7.99  100
=:V:N       2 158.48 578:13 65:03:38   7.68   7.99  100

Comment 1 wvoesch 2020-09-14 13:55:10 UTC

*** Bug 1878164 has been marked as a duplicate of this bug. ***

Comment 2 Dan Li 2020-09-16 20:25:51 UTC

Changing reported Version from 4.6.z to 4.6

Comment 3 Junqi Zhao 2020-09-21 02:58:51 UTC

Created attachment 1715485 [details]
does not find the issue on AWS

Checked on AWS cluster, no such issue, maybe it relates to s390x

Comment 4 Hendrik Brueckner 2020-09-21 08:21:45 UTC

Prarit,

one of my team members sees similar issues on a plain RHEL 8.2 installation. Do you if there are any steal-time related issues reported / corrected on the RHEL 8.2 kernel?

Comment 5 Hendrik Brueckner 2020-09-24 10:06:55 UTC

To check for potential accounting issues, please provide hypervisor level information, e.g., with hyptop or perfkit and the Linux inband information for comparision.

Comment 6 wvoesch 2020-09-24 14:42:44 UTC

Hi Hendrik, please see the information in the initial comment. 

Pawel, can you please tell me the formula how the CPU consumption is calculated?

Comment 7 Pawel Krupa 2020-10-01 07:09:13 UTC

promQL query for CPU usage used in console is following: `instance:node_cpu:rate:sum{}` which under the hood translates to `sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[3m])) BY (instance)`

Please check if `node_cpu_seconds_total{mode!="idle",mode!="iowait"}` is returning negative values, it shouldn't.

Note: comparing `top`, `hyptop`, and `htop` values to ones presented by console dashboard is incorrect. CLI commands show only instant values and they do not perform any lookback the way console does it.

Comment 10 alisha 2020-10-05 11:22:54 UTC

Created attachment 1718968 [details]
master-0 node info.

I installed OCP 4.6 on ppc64le (little endian). The issue was not seen.
Build used : 4.6.0-0.nightly-ppc64le-2020-10-02-231830

# oc adm top nodes
NAME       CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
master-0   1358m        18%    8798Mi          27%
master-1   1265m        16%    13261Mi         42%
master-2   1320m        17%    11436Mi         36%
worker-0   1196m        7%     12269Mi         80%
worker-1   1524m        9%     7122Mi          46%

Comment 12 Pawel Krupa 2020-11-05 15:01:23 UTC

Tried reproducing it on OpenShift 4.6.1 on "IBM/S390" and the issue doesn't occur. Closing as it couldn't be reproduced by me and other people. Please attach prometheus database dump if this issue presents itself in the future.

Comment 13 Prashanth Sundararaman 2020-11-05 15:27:28 UTC

I tested the 4.6.1 image on both a zVM and a libvirt cluster and I did not see this

@wvoesch can you test with the 4.6.1 GA release image? https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp/4.6.1/

Comment 14 wvoesch 2020-11-13 08:16:58 UTC

Hi all,

I tested with the GAed version 4.6.1 and I could reproduce the issue on two clusters running on z13 and z14 respectively. 
I found that the negative available CPU is caused by the steal time present on a node. 

Steps to reproduce:
1. Schedule a CPU workload on a node.
   - the pod(s) should be in the quality of service class best effort
   - the workload should be enough to stress full available CPUs of that node
2. the node should experience steal time
3. observe the negative numbers on available CPU on the Web Ui

illustrative example:
A worker has two CPUs configured. The workload can use two CPUs. The steal time on the worker is 30%, the idle time is 0%. Then Prometheus shows a negative CPUs availability of 2*0.3=0.6 CPUs. 

@

Comment 15 wvoesch 2020-11-13 08:19:43 UTC

@Pawel Krupa 

If you still need the prometheus database dump, I'm happy to help.  In that case, can you please point me to a description how get the dump? Thank you

Comment 16 Pawel Krupa 2020-11-16 14:30:08 UTC

@wvoesch 

I would be grateful for prometheus data dump. We have a procedure for this in https://access.redhat.com/solutions/5426631 or if that doesn't work, you can try dumping all files from `/prometheus` directory in `openshift-monitoring/prometheus-k8s-0` pod (`prometheus` container) and sending me the tarball.

Comment 17 wvoesch 2020-11-19 16:20:24 UTC

@pkrupa

I have gathered the requested data. Due to the size I will send you a download link to your email.

Comment 18 Pawel Krupa 2020-11-20 11:28:04 UTC

Thanks for prometheus dump. I already have an upstream PR with a fix waiting for review.

Comment 22 Junqi Zhao 2020-12-11 02:00:02 UTC

Hi,

The fix is in the 47 nightly payload, but QE don't have s390x platform to verify the fix, could you help to check

Comment 23 Dan Li 2020-12-15 16:06:25 UTC

Marking Junqi's comment un-private as Wolfgang cannot see private comment

Comment 24 hongyan li 2020-12-18 03:37:07 UTC

Created attachment 1740090 [details]
available cpu is negative value, reproduced on AWS

Most time, pos use all the CPU of the node, seldom steal time.

Comment 25 hongyan li 2020-12-18 03:55:28 UTC

On AWS platform, CPU consumption is never more than the CPU count of that node, the CUP consumption only displays the CPU count of that node when the CPU load is very high.
Can see available cpu with very little negative value sometimes. See comments 24

Comment 26 Junqi Zhao 2020-12-18 04:00:42 UTC

(In reply to hongyan li from comment #24)
> Created attachment 1740090 [details]
> available cpu is negative value, reproduced on AWS
> 
> Most time, pos use all the CPU of the node, seldom steal time.

when the cpu usage suddenly surge, the negative value is allowed, and the value is not big enough, see from the original picture for the bug, 4 CPUs shows a CPU consumption of 8.37 CPUs and -4,370.686m CPU is still available

Comment 27 hongyan li 2020-12-18 09:18:16 UTC

The test for comments 24 and 25 is on 4.6. Test with payload 4.7.0-0.nightly-2020-12-17-201522, didn't see negative value for available CPU。Since never see CUP consumption is over the CPU count of that node for both 4.6 and 4.7, still need test on platform s390

Comment 29 wvoesch 2020-12-18 15:41:50 UTC

I tested again on s390x under z/VM. OCP version: 4.7.0-0.nightly-s390x-2020-12-15-081322

I have two observations: 

1. I not seen negative values of the available CPU consumption. This is good. 

2. I have observed that while increasing the steal time, the available CPU shown by Prometheus has been reduced, which is expected. 
   However Prometheus has also increased the CPU consumption, although no additional CPU load has been scheduled. 
   It seems as if the CPU usage is calculated like this: CPU usage = CPU count - available CPU
   in order to reflect the correct cpu usage I think it should either be: CPU usage = CPU count - available CPU - steal time
   or it could be calculated by CPU usage = sum over all processes 

From my perspective I think it would be best, if the steal time could be displayed directly to remove all the confusion.  

The question is, how do we proceed from here:
a. shall we close this bug and open a new one to address my observation 2 of the wrong accounting, or shall we address my observation 2 in this BZ?
b. what is the process of getting the steal time displayed too? Can this work item be tracked in a BZ?

Comment 30 hongyan li 2020-12-21 07:20:10 UTC

@wvoesch 

I total agree with you about comments 29. I have tried to test the issue on AWS platform, see very very little steal time happens and want to know how to increase steal time.

As we know 'CPU steal time is defined in the GNU top command as "time stolen from [a] VM by the hypervisor." ', and I completed the following step 1. I didn't see much steal time on the node and I don't think we should see it, the pod on the node has consumed all the available CPU on the node, so the node need steal time from platform or other node, its CPU time can't be stolen for it's short of CPU time.

1. Schedule a CPU workload on a node.
   - the pod(s) should be in the quality of service class best effort
   - the workload should be enough to stress full available CPUs of that node
2. the node should experience steal time

Please help clarify,

Comment 31 Junqi Zhao 2020-12-21 13:38:59 UTC

(In reply to wvoesch from comment #29)
> 
> The question is, how do we proceed from here:
> a. shall we close this bug and open a new one to address my observation 2 of
> the wrong accounting, or shall we address my observation 2 in this BZ?
I think we can close this bug for file a new one for your observation 2

> b. what is the process of getting the steal time displayed too? Can this
> work item be tracked in a BZ?
This maybe an RFE, I think we could track via BZ

Comment 32 Junqi Zhao 2020-12-21 13:39:45 UTC

set to VERIFIED based on Comment 29

Comment 33 wvoesch 2020-12-22 10:13:14 UTC

@juzhao 
I have opened a BZ to track the issue around accounting steal time as cpu usage. Please see BZ 1910006


@hongyli
I'm not sure if I understand your question correctly. 
The hypervisor can remove CPU resources from a guest. The steal time is the time stolen from a VM by the hypervisor, that means the VM wants to use the CPU but is not allowed. The time it has to wait is the steal time. This can of course only happen, if the node lives in an overcommitted environment or if the hypervisor is using cpu time on behalf of the linux guest. Maybe [1] might be helpful? 


[1] https://www.vm.ibm.com/perf/tips/prgcom.html see solution to problem: “I see a non-trivial number in my Linux Reports for %Steal. Is this a problem?”

Comment 34 hongyan li 2020-12-23 02:53:39 UTC

@wvoesch 
Thank you very much! You comments and description in new bug answered my question.

Comment 37 errata-xmlrpc 2021-02-24 15:17:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.

alegrand
alisha.prabhu
alklein
anpicker
brueckner
bugproxy
danijel.soldo
danili
dgilmore
erooth
hannsj_uhl
Holger.Wolf
hongyli
kakkoyun
lcosic
nbziouec
pkrupa
prarit
psundara
surbania