Created attachment 1714793 [details] screenshot of a higher CPU consumption than the CPU count of that node. Description of problem: On several clusters I have observed, that the OCP dashboard shows a higher CPU consumption than a) the actual consumption b) what is actually configured to a particular node. For example, a master node with 4 CPUs shows a CPU consumption of 8.37 CPUs and -4,370.686m CPU is still available. Please see the screenshot. I observed this three different clusters. The clusters include nodes with ECKD and ZCP disks, direct attached OSA and Hipersockes. Typically the clusters have three masters and 2 workers, of size 2 CPUs and 8 GB. But one cluster has 3 workers with 10 CPUs and 32 GB each. Because of the size, I just added some basic information below. Please let me know what information you need for further debugging and I shall provide it happily. Thank you. Additional info: ############################################################################################################################################### Cluster-identifyer: s8343008, Version: 4.6.0-0.nightly-s390x-2020-09-05-222506 RHCOS: 46.82.202009042339-0 Environment: Z13, 3 masters [16 GB, 4 CPUs], 2 workers [8 GB, 2 CPUs], 4x fcp 120 GB, 1x 50 GB dasd, direct attached OSA, age: 3d21h master 3: configured 4 CPUs OCP dashboard: CPU consumption ~6.09 of 4. -2,052.037m available on the node: top - 08:43:24 up 3 days, 22:14, 1 user, load average: 47.41, 40.55, 36.88 Tasks: 322 total, 5 running, 314 sleeping, 0 stopped, 3 zombie %Cpu(s): 27.6 us, 9.7 sy, 0.0 ni, 14.9 id, 0.1 wa, 1.6 hi, 1.7 si, 44.3 st MiB Mem : 16097.1 total, 101.4 free, 9331.9 used, 6663.7 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 9982.2 avail Mem note the 3 zombies, on the other masters there are 3 and 2 respecitvely. # sudo hyptop system #cpu cpu Cpu+ online memuse memmax wcur (str) (#) (%) (hm) (dhm) (GiB) (GiB) (#) S8343012 4 272.16 485:02 21:18:53 15.76 16.00 100 =:V:N 4 272.16 485:02 21:18:53 15.76 16.00 100 ############################################################################################################################################### Cluster-identifyer: s8343001, Version: 4.6.0-0.nightly-s390x-2020-09-07-065507 RHCOS: 46.82.202009042339-0 Environment: Z13, 3 masters [16 GB, 4 CPUs], 3 workers [32 GB, 10 CPUs], 6x fcp 120 GB, Hipersockets, age: 3d23h master 1: configured 4 CPUs OCP dashboard: CPU consumption ~4.95 of 4. -947.756m available top - 12:52:43 up 3 days, 5:53, 1 user, load average: 30.24, 25.30, 24.83 Tasks: 308 total, 4 running, 303 sleeping, 0 stopped, 1 zombie %Cpu(s): 23.0 us, 7.5 sy, 0.0 ni, 7.7 id, 0.1 wa, 1.0 hi, 1.4 si, 59.2 st MiB Mem : 15962.1 total, 711.8 free, 8834.4 used, 6415.9 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 11069.8 avail Mem hyptop: system #cpu cpu Cpu+ online memuse memmax wcur (str) (#) (%) (hm) (dhm) (GiB) (GiB) (#) S8343003 4 268.16 494:21 52:21:10 15.82 15.87 100 =:V:N 4 268.16 494:21 52:21:10 15.82 15.87 100 ############################################################################################################################################### Cluster-identifyer: s8343022, Version: 4.6.0-0.nightly-s390x-2020-08-27-080214 RHCOS: 46.82.202008261939-0 Environment: Z13, 3 masters [16 GB, 4 CPUs], 2 workers [8 GB, 2 CPUs], 5x fcp 120 GB, direct attached OSA, age: 14d master 2: configured 4 CPUs OCP dashboard: CPU consumption ~5.034 of 4. -1,043.793m available top - 08:56:39 up 14 days, 17:38, 1 user, load average: 22.24, 26.02, 25.87 Tasks: 332 total, 1 running, 329 sleeping, 0 stopped, 2 zombie %Cpu(s): 19.7 us, 7.5 sy, 0.0 ni, 41.4 id, 0.5 wa, 1.2 hi, 1.3 si, 28.3 st MiB Mem : 16097.1 total, 140.4 free, 11375.0 used, 4581.6 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 9145.1 avail Mem hyptop: system #cpu cpu Cpu+ online memuse memmax wcur (str) (#) (%) (hm) (dhm) (GiB) (GiB) (#) S8343025 4 200.61 1164:01 65:00:21 15.77 16.00 100 =:V:N 4 200.61 1164:01 65:00:21 15.77 16.00 100 worker 1: configured 2 CPUs OC dashboard: COU consumption ~3.2 of 2. -1,113.494m available top - 12:13:42 up 14 days, 20:46, 1 user, load average: 31.80, 27.86, 29.22 Tasks: 833 total, 7 running, 279 sleeping, 0 stopped, 547 zombie %Cpu(s): 26.0 us, 12.4 sy, 0.0 ni, 0.6 id, 0.3 wa, 0.5 hi, 1.3 si, 58.9 st MiB Mem : 8022.1 total, 57.8 free, 5968.1 used, 1996.2 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1915.7 avail Mem system #cpu cpu Cpu+ online memuse memmax wcur (str) (#) (%) (hm) (dhm) (GiB) (GiB) (#) S8343027 2 158.48 578:13 65:03:38 7.68 7.99 100 =:V:N 2 158.48 578:13 65:03:38 7.68 7.99 100
*** Bug 1878164 has been marked as a duplicate of this bug. ***
Changing reported Version from 4.6.z to 4.6
Created attachment 1715485 [details] does not find the issue on AWS Checked on AWS cluster, no such issue, maybe it relates to s390x
Prarit, one of my team members sees similar issues on a plain RHEL 8.2 installation. Do you if there are any steal-time related issues reported / corrected on the RHEL 8.2 kernel?
To check for potential accounting issues, please provide hypervisor level information, e.g., with hyptop or perfkit and the Linux inband information for comparision.
Hi Hendrik, please see the information in the initial comment. Pawel, can you please tell me the formula how the CPU consumption is calculated?
promQL query for CPU usage used in console is following: `instance:node_cpu:rate:sum{}` which under the hood translates to `sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[3m])) BY (instance)` Please check if `node_cpu_seconds_total{mode!="idle",mode!="iowait"}` is returning negative values, it shouldn't. Note: comparing `top`, `hyptop`, and `htop` values to ones presented by console dashboard is incorrect. CLI commands show only instant values and they do not perform any lookback the way console does it.
Created attachment 1718968 [details] master-0 node info. I installed OCP 4.6 on ppc64le (little endian). The issue was not seen. Build used : 4.6.0-0.nightly-ppc64le-2020-10-02-231830 # oc adm top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% master-0 1358m 18% 8798Mi 27% master-1 1265m 16% 13261Mi 42% master-2 1320m 17% 11436Mi 36% worker-0 1196m 7% 12269Mi 80% worker-1 1524m 9% 7122Mi 46%
Tried reproducing it on OpenShift 4.6.1 on "IBM/S390" and the issue doesn't occur. Closing as it couldn't be reproduced by me and other people. Please attach prometheus database dump if this issue presents itself in the future.
I tested the 4.6.1 image on both a zVM and a libvirt cluster and I did not see this @wvoesch can you test with the 4.6.1 GA release image? https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp/4.6.1/
Hi all, I tested with the GAed version 4.6.1 and I could reproduce the issue on two clusters running on z13 and z14 respectively. I found that the negative available CPU is caused by the steal time present on a node. Steps to reproduce: 1. Schedule a CPU workload on a node. - the pod(s) should be in the quality of service class best effort - the workload should be enough to stress full available CPUs of that node 2. the node should experience steal time 3. observe the negative numbers on available CPU on the Web Ui illustrative example: A worker has two CPUs configured. The workload can use two CPUs. The steal time on the worker is 30%, the idle time is 0%. Then Prometheus shows a negative CPUs availability of 2*0.3=0.6 CPUs. @
@Pawel Krupa If you still need the prometheus database dump, I'm happy to help. In that case, can you please point me to a description how get the dump? Thank you
@wvoesch I would be grateful for prometheus data dump. We have a procedure for this in https://access.redhat.com/solutions/5426631 or if that doesn't work, you can try dumping all files from `/prometheus` directory in `openshift-monitoring/prometheus-k8s-0` pod (`prometheus` container) and sending me the tarball.
@pkrupa I have gathered the requested data. Due to the size I will send you a download link to your email.
Thanks for prometheus dump. I already have an upstream PR with a fix waiting for review.
Hi, The fix is in the 47 nightly payload, but QE don't have s390x platform to verify the fix, could you help to check
Marking Junqi's comment un-private as Wolfgang cannot see private comment
Created attachment 1740090 [details] available cpu is negative value, reproduced on AWS Most time, pos use all the CPU of the node, seldom steal time.
On AWS platform, CPU consumption is never more than the CPU count of that node, the CUP consumption only displays the CPU count of that node when the CPU load is very high. Can see available cpu with very little negative value sometimes. See comments 24
(In reply to hongyan li from comment #24) > Created attachment 1740090 [details] > available cpu is negative value, reproduced on AWS > > Most time, pos use all the CPU of the node, seldom steal time. when the cpu usage suddenly surge, the negative value is allowed, and the value is not big enough, see from the original picture for the bug, 4 CPUs shows a CPU consumption of 8.37 CPUs and -4,370.686m CPU is still available
The test for comments 24 and 25 is on 4.6. Test with payload 4.7.0-0.nightly-2020-12-17-201522, didn't see negative value for available CPU。Since never see CUP consumption is over the CPU count of that node for both 4.6 and 4.7, still need test on platform s390
I tested again on s390x under z/VM. OCP version: 4.7.0-0.nightly-s390x-2020-12-15-081322 I have two observations: 1. I not seen negative values of the available CPU consumption. This is good. 2. I have observed that while increasing the steal time, the available CPU shown by Prometheus has been reduced, which is expected. However Prometheus has also increased the CPU consumption, although no additional CPU load has been scheduled. It seems as if the CPU usage is calculated like this: CPU usage = CPU count - available CPU in order to reflect the correct cpu usage I think it should either be: CPU usage = CPU count - available CPU - steal time or it could be calculated by CPU usage = sum over all processes From my perspective I think it would be best, if the steal time could be displayed directly to remove all the confusion. The question is, how do we proceed from here: a. shall we close this bug and open a new one to address my observation 2 of the wrong accounting, or shall we address my observation 2 in this BZ? b. what is the process of getting the steal time displayed too? Can this work item be tracked in a BZ?
@wvoesch I total agree with you about comments 29. I have tried to test the issue on AWS platform, see very very little steal time happens and want to know how to increase steal time. As we know 'CPU steal time is defined in the GNU top command as "time stolen from [a] VM by the hypervisor." ', and I completed the following step 1. I didn't see much steal time on the node and I don't think we should see it, the pod on the node has consumed all the available CPU on the node, so the node need steal time from platform or other node, its CPU time can't be stolen for it's short of CPU time. 1. Schedule a CPU workload on a node. - the pod(s) should be in the quality of service class best effort - the workload should be enough to stress full available CPUs of that node 2. the node should experience steal time Please help clarify,
(In reply to wvoesch from comment #29) > > The question is, how do we proceed from here: > a. shall we close this bug and open a new one to address my observation 2 of > the wrong accounting, or shall we address my observation 2 in this BZ? I think we can close this bug for file a new one for your observation 2 > b. what is the process of getting the steal time displayed too? Can this > work item be tracked in a BZ? This maybe an RFE, I think we could track via BZ
set to VERIFIED based on Comment 29
@juzhao I have opened a BZ to track the issue around accounting steal time as cpu usage. Please see BZ 1910006 @hongyli I'm not sure if I understand your question correctly. The hypervisor can remove CPU resources from a guest. The steal time is the time stolen from a VM by the hypervisor, that means the VM wants to use the CPU but is not allowed. The time it has to wait is the steal time. This can of course only happen, if the node lives in an overcommitted environment or if the hypervisor is using cpu time on behalf of the linux guest. Maybe [1] might be helpful? [1] https://www.vm.ibm.com/perf/tips/prgcom.html see solution to problem: “I see a non-trivial number in my Linux Reports for %Steal. Is this a problem?”
@wvoesch Thank you very much! You comments and description in new bug answered my question.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633