Hide Forgot
Description of problem: I have observed that while increasing the steal time, the available CPU shown by Prometheus has been reduced, which is expected. However Prometheus has also increased the CPU consumption, although no additional CPU load has been scheduled on that node. It seems as if the CPU usage is calculated like this: CPU usage = CPU count - available CPU in order to reflect the correct cpu usage I think it should either be: CPU usage = CPU count - available CPU - steal time or it could be calculated by CPU usage = sum over the CPU consumption of all processes Version-Release number of selected component (if applicable): 4.7.0-0.nightly-s390x-2020-12-15-081322 Steps to reproduce: 1. Monitor the available CPU resources and the CPU usage of a particular node, say node A, in Prometheus. 2. Increase the steal time on that particular node. Possible options how to achieve this: a. Configure a CPU overcommitment for node A and another node B. Schedule CPU intensive workload (stess-ng) on node B. Due to the CPU overcommitment node A will experience steal time. b. On node A start a I/O intensive process, like for exampling coping of huge files. This will result in steal time because “z/VM was executing on behalf of the Linux virtual processor” [1]. 3. Observe that the steal time will be counted as CPU usage of node A. Additional Information: This bug is a follow up of the BZ: 1878766 see comment 29 and 31. [1] https://www.vm.ibm.com/perf/tips/prgcom.html see solution to problem: “I see a non-trivial number in my Linux Reports for %Steal. Is this a problem?”
> Observe that the steal time will be counted as CPU usage of node A. Where do you observe this? After including this change in kube-prometheus [1] and propagating it to cluster-monitoring-operator [2] we no longer treat steal time as part of CPU usage (result of [3]). Plus if you are using `instance:node_cpu:rate:sum` recording rule, then CPU usage is counted as: `node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}` averaged over last 3 minutes [1]: https://github.com/prometheus-operator/kube-prometheus/commit/87ddb30a41253dce66bde0006634f30817ccb07a [2]: https://github.com/openshift/cluster-monitoring-operator/pull/993 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1878766
Hi Pawel, I observe this in the WebUI overview for a particular node: https://console-openshift-console.apps.<cluster-name>.<domain>/k8s/cluster/nodes/<worker>.<cluster-name>.<domain> The version I observed this was: 4.7.0-0.nightly-s390x-2020-12-15-081322
Hi Jayapriya, could you please specify which information you need?
Have no s390x machine, can't test. Tried to test on AWS, deployed app which need 7CPU on a node with 4CPU, all the pods are running and use up all 4 CPU, but didn't see CPU steal time on all the other nodes. Wait for wvoesch to verify.
Wolfgang can you help to check in s390x machines, we don't have the platform, and the issue is not happen with AWS/GCP
Making Jinqi's request un-private as Wolfgang is a Partner Engineer and cannot see private comment(s)
tested with 4.8.0-0.nightly-2021-06-10-071057, steal time is removed from CPU usage - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[3m])) BY (instance) record: instance:node_cpu:rate:sum
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.21 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2762