Bug 1876937
| Summary: | Wrong cpu.current.guest statistics values returned by GET https://{host}/ovirt-engine/api/vms/{vm_id}/statistics | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Polina <pagranat> | ||||||
| Component: | BLL.Virt | Assignee: | Liran Rotenberg <lrotenbe> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Polina <pagranat> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 4.4.3.1 | CC: | ahadas, aoconnor, bugs, bugzilla-qe-rhv, emarcus, lcapitulino, lrotenbe, michal.skrivanek, mtessun, mzamazal, nilal, pauld, pbonzini, phrdina, virt-maint, vkuznets, ymankad | ||||||
| Target Milestone: | ovirt-4.4.4 | Keywords: | Automation, Regression, TestOnly | ||||||
| Target Release: | --- | Flags: | pm-rhel:
ovirt-4.4+
aoconnor: blocker- pm-rhel: devel_ack+ |
||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | ovirt-engine-4.4.4.1 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Previously, the CPU statistic values were incorrect. Now it shows the correct values.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2021-01-12 16:23:56 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | 1897716 | ||||||||
| Bug Blocks: | |||||||||
| Attachments: |
|
||||||||
|
Description
Polina
2020-09-08 14:43:45 UTC
Created attachment 1714294 [details]
logs including libvirtd.log
attaching logs again including the libvirt for the same scenario:
1. Configure VM1 and VM2 pinned to host, Low CPU Shares 512, CPU Pinning topology 0#0 (in Edit VM/Resource Allocation tab).
2. Run both VMs.
3. Load on the VMs CPU to 100%.
top on guest:
top - 17:22:48 up 11 min, 1 user, load average: 2.13, 1.31, 0.70
Tasks: 130 total, 4 running, 126 sleeping, 0 stopped, 0 zombie
%Cpu(s): 48.5 us, 0.0 sy, 1.0 ni, 0.0 id, 0.0 wa, 0.7 hi, 0.0 si, 49.8 st
MiB Mem : 751.1 total, 360.1 free, 173.5 used, 217.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 452.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1180 root 20 0 235944 1452 0 R 48.5 0.2 3:31.53 bash
1182 root 39 19 583036 58268 28392 R 0.7 7.6 0:00.64 dnf
versions:
qemu-kvm-5.1.0-4.module+el8.3.0+7846+ae9b566f.x86_64
libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64
on host
[root@lynx22 qemu]# vdsm-client Host getAllVmStats | grep cpu
"cpuSys": "49.86",
"cpuUsage": "456900000000",
"cpuUser": "0.05",
"vcpuCount": "1",
"vcpuPeriod": 100000,
"vcpuQuota": "-1",
"cpuSys": "49.73",
"cpuUsage": "462190000000",
"cpuUser": "0.17",
"vcpuCount": "1",
"vcpuPeriod": 100000,
"vcpuQuota": "-1",
agent version qemu-guest-agent-2.12.0-99.module+el8.2.0+6870+55b789b4.2.x86_64 to compare with working version. The cpu.current.guest statistic is correct in the previous build: qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64 libvirt-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64 vdsm-4.40.26-1.el8ev.x86_64 ovirt-engine-4.4.2.3-0.6.el8ev.noarch qemu-guest-agent-2.12.0-99.module+el8.2.0+6870+55b789b4.2.x86_64 The reported in guest CPU load is user, while reported to VDSM as system. We looked both in 4.4.2 and 4.4.3 versions, the guest is the same template, using the same qemu-agent. No code changes were made in that areas both in engine and VDSM. I'm switching the bug to libvirt for the time being to take a look on that. Hi, please provide libvirt API that is used to get the CPU load statistics so we can investigate if the issue is in libvirt. # virsh -r domstats <domain> Relevant values are: cpu.user cpu.system Thanks for the info. Libvirt gets these two values from cgroups but they don't represent CPU load.
From virsh manpage:
· cpu.user - user cpu time spent in nanoseconds
· cpu.system - system cpu time spent in nanoseconds
And documentation from cgroups v1:
user: Time spent by tasks of the cgroup in user mode.
system: Time spent by tasks of the cgroup in kernel mode.
These values report how much CPU time the whole VM used and it only increases during the VM lifetime.
I just tried running this using the libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64 version and it works as expected:
# virsh domstats test --cpu-total && cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\\x2d1\\x2dtest.scope/cpuacct.stat
Domain: 'test'
cpu.time=2250174409
cpu.user=240000000
cpu.system=650000000
cpu.cache.monitor.count=0
user 24
system 65
as you can see libvirt reports the same value (different units).
Running the same command using RHEL-8.2.1 with libvirt-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64 gives the same result:
# virsh domstats test --cpu-total && cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\\x2d1\\x2dtest.scope/cpuacct.stat
Domain: 'test'
cpu.time=627401622
cpu.user=420000000
cpu.system=100000000
cpu.cache.monitor.count=0
user 42
system 10
where it corresponds to what we can see directly in cgroups.
I don't think there is a bug in libvirt.
I used the same guest in both build 4.4.3 and 4.4.3. here are some info I gathered reproding the same scenario in 4.4.3 and 4.4.2: ###### 4.4.3 ######### on guest: [root@vm-30-51 tmp]# cat /sys/fs/cgroup/cpu/cpuacct.stat user 17894 system 799 on host: cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d7\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat user 109 system 26370 virsh -r domstats golden_env_mixed_virtio_0|grep cpu cpu.time=314820356830 cpu.user=1120000000 cpu.system=313730000000 cpu.cache.monitor.count=0 vcpu.current=1 vcpu.maximum=16 vcpu.0.state=1 vcpu.0.time=312860000000 vcpu.0.wait=0 Rest API GET https://{{host}}/ovirt-engine/api/vms/11aa9a07-65e7-4f38-adb5-b27b141406a5/statistics <statistic href="/ovirt-engine/api/vms/11aa9a07-65e7-4f38-adb5-b27b141406a5/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4"> <name>cpu.current.guest</name> <description>CPU used by guest</description> <kind>gauge</kind> <type>decimal</type> <unit>percent</unit> <values> <value> <datum>0.13</datum> </value> </values> <vm href="/ovirt-engine/api/vms/11aa9a07-65e7-4f38-adb5-b27b141406a5" id="11aa9a07-65e7-4f38-adb5-b27b141406a5"/> </statistic> ####### 4.4.2 ####### on guest: [root@vm-30-102 tmp]# cat /sys/fs/cgroup/cpu/cpuacct.stat user 10188 system 637 on host: cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d33\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat user 118 system 243 [root@ocelot05 ~]# virsh -r domstats golden_env_mixed_virtio_0|grep cpu cpu.time=327226791989 cpu.user=1230000000 cpu.system=2540000000 cpu.cache.monitor.count=0 vcpu.current=1 vcpu.maximum=16 vcpu.0.state=1 vcpu.0.time=325660000000 vcpu.0.wait=0 Rest API: GET https://{{host}}/ovirt-engine/api/vms/c62b806f-4c17-496b-9b1c-244e9fbfb16e/statistics <statistic href="/ovirt-engine/api/vms/c62b806f-4c17-496b-9b1c-244e9fbfb16e/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4"> <name>cpu.current.guest</name> <description>CPU used by guest</description> <kind>gauge</kind> <type>decimal</type> <unit>percent</unit> <values> <value> <datum>50</datum> </value> </values> <vm href="/ovirt-engine/api/vms/c62b806f-4c17-496b-9b1c-244e9fbfb16e" id="c62b806f-4c17-496b-9b1c-244e9fbfb16e"/> </statistic> (In reply to Pavel Hrdina from comment #8) > I don't think there is a bug in libvirt. hi Pavel, thanks for the detailed information. we compute the cpu usage in VDSM based on these stats: https://github.com/oVirt/vdsm/blob/v4.40.30/lib/vdsm/virt/vmstats.py#L139-L149 and yeah, it doesn't look like a libvirt bug. (In reply to Polina from comment #9) > ###### 4.4.3 ######### > > on guest: > > [root@vm-30-51 tmp]# cat /sys/fs/cgroup/cpu/cpuacct.stat > user 17894 > system 799 > > > on host: > > cat > /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d7\x2dgoldenenvmixedvirtio0. > scope/cpuacct.stat > user 109 > system 26370 > > ####### 4.4.2 ####### > > on guest: > > [root@vm-30-102 tmp]# cat /sys/fs/cgroup/cpu/cpuacct.stat > user 10188 > system 637 > > on host: > > cat > /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d33\x2dgoldenenvmixedvirtio0. > scope/cpuacct.stat > > user 118 > system 243 ok, so the stats from within the guest look similar and the parts that I dropped from comment 9 show that libvirt is consistent with the cgroup stats. it seems that the difference between those stats come down the the system-cpu time on the hosts (26370 vs. 243) that can explain the difference we see at the api-level (the bigger the overhead is, the lower the guest cpu time is) I wouldn't pay much attention to the cpu usage that is presented in the UI because it is computed differently there. the 4.4.3 hosts run rhe 8.3 while the 4.4.2 ones run rhel 8.2, right? can we get access to two hosts from different versions to compare that the 4.4.3 hosts run that takes so much extra cpu time? Yash, we need qemu guys to take a look at this and advice why we get different stats with RHEL 8.3 compared to RHEL 8.2 Could you suggest who can we approach? Hi Arik, I assume you are talking about the cgroups cpuacct stats: /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d7\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat They are cgroups/kernel and have no direct relation to qemu. The accounting is done in the kernel. cpuacct is documented upstream here: https://www.kernel.org/doc/Documentation/cgroup-v1/cpuacct.txt Important part: "cpuacct.stat file lists a few statistics which further divide the CPU time obtained by the cgroup into user and system times. Currently the following statistics are supported: user: Time spent by tasks of the cgroup in user mode. system: Time spent by tasks of the cgroup in kernel mode. user and system are in USER_HZ unit. cpuacct controller uses percpu_counter interface to collect user and system times. This has two side effects: - It is theoretically possible to see wrong values for user and system times. This is because percpu_counter_read() on 32bit systems isn't safe against concurrent writes. - It is possible to see slightly outdated values for user and system times due to the batch processing nature of percpu_counter. " Thanks Martin. Yeah, those are the stats that I referred to. For some reason, the system_time that we see there is (consistently) significantly higher on rhel8.3 compared to rhel8.2 and that affects our computation of the guest cpu stats We compute the guest_time as described in the first answer in [1]: guest_time = cpu_time - system_time - user_time In this bug it results in lower guest cpu usage than we previously reported and we have a different bug on reporting negative cpu usage. Maybe this change in behavior is unrelated to qemu but to cgroups/kernel but still we need someone to tell us if/how we need to adjust to this change in order to compute the guest cpu usage correctly [https://stackoverflow.com/questions/40468370/what-does-cpu-time-represent-exactly-in-libvirt] The other bug I mentioned above is bz 1878803 I did some additional testing to figure out which downstream kernel introduced this change.
My testing environment was RHEL-8.3 where I was starting a simple VM based on rhel-guest-image-8.2-290.x86_64.qcow2 and the only thing I changed between every test run was host kernel. By bisecting I managed to find the "broken" kernel:
kernel-4.18.0-233.el8.x86_64
cat /sys/fs/cgroup/cpuacct/machine.slice/machine-qemu\\x2d1\\x2drhel.scope/cpuacct.stat
user 199
system 1668
Using RHEL-8.3 with this kernel gave similar results as RHEL-8.2:
kernel-4.18.0-232.el8.x86_64
cat /sys/fs/cgroup/cpuacct/machine.slice/machine-qemu\\x2d1\\x2drhel.scope/cpuacct.stat
user 201
system 377
So my guess is that some patches backported to kernel-4.18.0-233.el8.x86_64 changed the way how system time is computed.
The question now is whether the reported CPU system stat is correct and ovirt has to change the way how it calculates CPU usage or the change in kernel is incorrect.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. (In reply to Pavel Hrdina from comment #17) > So my guess is that some patches backported to kernel-4.18.0-233.el8.x86_64 > changed the way how system time is computed. Bug 1833096 looks like the relevant change with series of patches about time accounting Phil, I see you added yourself to this bug so you're aware. Any idea? tested with baseurl=http://brew-task-repos.usersys.redhat.com/repos/scratch/pauld/kernel/4.18.0/232.el8.bz1833096_test1/$basearch/ The behavior is still incorrect. <statistic href="/ovirt-engine/api/vms/5271a6e3-73d7-4358-b5d7-84b31af24a9f/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4"> <name>cpu.current.guest</name> <description>CPU used by guest</description> <kind>gauge</kind> <type>decimal</type> <unit>percent</unit> <values> <value> <datum>0.27</datum> </value> </values> <vm href="/ovirt-engine/api/vms/5271a6e3-73d7-4358-b5d7-84b31af24a9f" id="5271a6e3-73d7-4358-b5d7-84b31af24a9f"/> </statistic> while expected <datum>100</datum> The numbers are similar to the previous test : on guest cat /sys/fs/cgroup/cpu/cpuacct.stat user 95280 system 36696 on host /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d2\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat user 229 system 141009 virsh -r domstats golden_env_mixed_virtio_0|grep cpu cpu.time=1533392880883 cpu.user=2370000000 cpu.system=1527770000000 cpu.cache.monitor.count=0 vcpu.current=1 vcpu.maximum=16 vcpu.0.state=1 vcpu.0.time=1524320000000 vcpu.0.wait=0 Hi, I've tested with this last kernel version and got the correct behavior.
for 100% cpu loaded VM I got the expected statistic
<statistic href="/ovirt-engine/api/vms/e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4">
<name>cpu.current.guest</name>
<description>CPU used by guest</description>
<kind>gauge</kind>
<type>decimal</type>
<unit>percent</unit>
<values>
<value>
<datum>100</datum>
</value>
</values>
<vm href="/ovirt-engine/api/vms/e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe" id="e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe"/>
</statistic>
on guest:
cat /sys/fs/cgroup/cpu/cpuacct.stat
user 18237
system 6679
on host :
[root@puma42 yum.repos.d]# cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\\x2d1\\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat
user 169
system 503
[root@puma42 yum.repos.d]# virsh -r domstats golden_env_mixed_virtio_0|grep cpu
cpu.time=387282366516
cpu.user=1710000000
cpu.system=5100000000
cpu.cache.monitor.count=0
vcpu.current=1
vcpu.maximum=16
vcpu.0.state=1
vcpu.0.time=383910000000
vcpu.0.wait=0
Also performed the second test from https://bugzilla.redhat.com/show_bug.cgi?id=1876937#c0
and it also showed the correct values <datum>50</datum>
*** Bug 1878803 has been marked as a duplicate of this bug. *** yes , I think no problem with that Did the test with this patch . The value are not valid again .
<statistic href="/ovirt-engine/api/vms/e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4">
<name>cpu.current.guest</name>
<description>CPU used by guest</description>
<kind>gauge</kind>
<type>decimal</type>
<unit>percent</unit>
<values>
<value>
<datum>0.5</datum>
</value>
</values>
<vm href="/ovirt-engine/api/vms/e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe" id="e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe"/>
</statistic>
[root@dhcp163-120 ~]# cat /sys/fs/cgroup/cpu/cpuacct.stat
user 21937
system 8332
[root@puma42 ~]# cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\\x2d1\\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat
user 188
system 33842
[root@puma42 ~]# virsh -r domstats golden_env_mixed_virtio_0|grep cpu
cpu.time=495255742261
cpu.user=1920000000
cpu.system=497340000000
cpu.cache.monitor.count=0
vcpu.current=1
vcpu.maximum=16
vcpu.0.state=1
vcpu.0.time=490690000000
vcpu.0.wait=0
the test in https://bugzilla.redhat.com/show_bug.cgi?id=1876937#c36 has been performed after installing comment #32 & #35: 1. updated baseurl=http://brew-task-repos.usersys.redhat.com/repos/scratch/pauld/kernel/4.18.0/232.el8.bz1833096_test3/$basearch/ in kernel.repo 2. dnf install kernel-4.18.0-232.el8.bz1833096_test3 3. rpm -iv kernel-5.9.0.ark_latest+-1.x86_64.rpm rpm -iv kernel-devel-5.9.0.ark_latest+-1.x86_64.rpm rpm -iv kernel-headers-5.9.0.ark_latest+-1.x86_64.rpm 4. host reboot Hi, I repeated the test twice: uname -a Linux puma43.scl.lab.tlv.redhat.com 4.18.0-232.el8.bz1833096_test3.x86_64 #1 SMP Wed Oct 14 14:29:47 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux uname -a Linux puma43.scl.lab.tlv.redhat.com 5.9.0.ark_latest+ #1 SMP Wed Oct 14 17:27:46 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux root@puma43 rpms]# rpm -qa |grep kernel-5.9.0.ark_latest+-1.x86_64 kernel-5.9.0.ark_latest+-1.x86_64 [root@puma43 rpms]# rpm -qa |grep kernel-devel-5.9.0.ark_latest+-1.x86_64 kernel-devel-5.9.0.ark_latest+-1.x86_64 [root@puma43 rpms]# rpm -qa |grep kernel-headers-5.9.0.ark_latest+-1.x86_64 kernel-headers-5.9.0.ark_latest+-1.x86_64 Unfortunately, the results are the same , wrong statistic values in both tests Thanks to Nitesh the host was updated. uname -a Linux puma42.scl.lab.tlv.redhat.com 4.18.0-240.el8.bz1876937.x86_64 #1 SMP Tue Oct 20 17:47:57 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux It works correctly . the statistics value for cpu.current.guest correct Tested bor both scenarios reported in https://bugzilla.redhat.com/show_bug.cgi?id=1876937#c0 Hi Polina, The problematic patch has been reverted in the 8.4 dev-test kernel(kernel-4.18.0-240.8.el8.dt2). I think it will be good if you or someone else can also verify this kernel in your environment to make sure that things are working as expected without any regression. PS: I do remember that you have tested a build that I provided previously, however, that build was on top of the 8.3 kernel. Thanks The available latest RHV installation is on RHEL 8.3. It looks like we don't have an easy way to test the 8.4, it fails while installing host to engine. This bug is in NEW status for ovirt 4.4.4. The bug has not been acknowledged to be a blocker, please re-target. verified on ovirt-engine-4.4.4.5-0.10.el8ev.noarch kernel 4.18.0-240.10.1.el8_3.x86_64 This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |