Created attachment 1714129 [details] logs Description of problem: The GET https://{{host}}/ovirt-engine/api/vms/ba351434-6a8a-4327-82f0-379c019a5994/statistics in the last v.4.4.3.1 returns invalid values. Version-Release number of selected component (if applicable): vdsm-http-4.40.28-1.el8ev.noarch python3-libvirt-6.6.0-1.module+el8.3.0+7572+bcbf6b90.x86_64 ovirt-engine-4.4.3.1-0.7.el8ev.noarch How reproducible:100% 1. Run VM and load CPU on the VM to 100% (may be done with endless while loop) 2. Send GET https://{{host}}/ovirt-engine/api/vms/{{vm_id}}/statistics In the UI the load CPU is shown correctly but in the rest response the value is <datum>0.090</datum> instead of <datum>100</datum> (it works ok in 4.4.2) <statistic href="/ovirt-engine/api/vms/ba351434-6a8a-4327-82f0-379c019a5994/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4"> <name>cpu.current.guest</name> <description>CPU used by guest</description> <kind>gauge</kind> <type>decimal</type> <unit>percent</unit> <values> <value> <datum>0.090</datum> </value> </values> <vm href="/ovirt-engine/api/vms/ba351434-6a8a-4327-82f0-379c019a5994" id="ba351434-6a8a-4327-82f0-379c019a5994"/> </statistic> Expected in this case: <datum>100</datum> As result all the automation tests that checks cpu shares fail. Example of the scenario: 1. Configure VM1 and VM2 pinned to host, Low CPU Shares 512, CPU Pinning topology 0#0 (in Edit VM/Resource Allocation tab). 2. Run both VMs. 3. Load on the VMs CPU to 100%. In UI it is shown as expected 50% for each VM. but in GET statistics request the <value><datum>0.23</datum></value> Expected in this case: <value><datum>50</datum></value>
Created attachment 1714294 [details] logs including libvirtd.log attaching logs again including the libvirt for the same scenario: 1. Configure VM1 and VM2 pinned to host, Low CPU Shares 512, CPU Pinning topology 0#0 (in Edit VM/Resource Allocation tab). 2. Run both VMs. 3. Load on the VMs CPU to 100%. top on guest: top - 17:22:48 up 11 min, 1 user, load average: 2.13, 1.31, 0.70 Tasks: 130 total, 4 running, 126 sleeping, 0 stopped, 0 zombie %Cpu(s): 48.5 us, 0.0 sy, 1.0 ni, 0.0 id, 0.0 wa, 0.7 hi, 0.0 si, 49.8 st MiB Mem : 751.1 total, 360.1 free, 173.5 used, 217.5 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 452.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1180 root 20 0 235944 1452 0 R 48.5 0.2 3:31.53 bash 1182 root 39 19 583036 58268 28392 R 0.7 7.6 0:00.64 dnf versions: qemu-kvm-5.1.0-4.module+el8.3.0+7846+ae9b566f.x86_64 libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64
on host [root@lynx22 qemu]# vdsm-client Host getAllVmStats | grep cpu "cpuSys": "49.86", "cpuUsage": "456900000000", "cpuUser": "0.05", "vcpuCount": "1", "vcpuPeriod": 100000, "vcpuQuota": "-1", "cpuSys": "49.73", "cpuUsage": "462190000000", "cpuUser": "0.17", "vcpuCount": "1", "vcpuPeriod": 100000, "vcpuQuota": "-1",
agent version qemu-guest-agent-2.12.0-99.module+el8.2.0+6870+55b789b4.2.x86_64
to compare with working version. The cpu.current.guest statistic is correct in the previous build: qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64 libvirt-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64 vdsm-4.40.26-1.el8ev.x86_64 ovirt-engine-4.4.2.3-0.6.el8ev.noarch qemu-guest-agent-2.12.0-99.module+el8.2.0+6870+55b789b4.2.x86_64
The reported in guest CPU load is user, while reported to VDSM as system. We looked both in 4.4.2 and 4.4.3 versions, the guest is the same template, using the same qemu-agent. No code changes were made in that areas both in engine and VDSM. I'm switching the bug to libvirt for the time being to take a look on that.
Hi, please provide libvirt API that is used to get the CPU load statistics so we can investigate if the issue is in libvirt.
# virsh -r domstats <domain> Relevant values are: cpu.user cpu.system
Thanks for the info. Libvirt gets these two values from cgroups but they don't represent CPU load. From virsh manpage: · cpu.user - user cpu time spent in nanoseconds · cpu.system - system cpu time spent in nanoseconds And documentation from cgroups v1: user: Time spent by tasks of the cgroup in user mode. system: Time spent by tasks of the cgroup in kernel mode. These values report how much CPU time the whole VM used and it only increases during the VM lifetime. I just tried running this using the libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64 version and it works as expected: # virsh domstats test --cpu-total && cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\\x2d1\\x2dtest.scope/cpuacct.stat Domain: 'test' cpu.time=2250174409 cpu.user=240000000 cpu.system=650000000 cpu.cache.monitor.count=0 user 24 system 65 as you can see libvirt reports the same value (different units). Running the same command using RHEL-8.2.1 with libvirt-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64 gives the same result: # virsh domstats test --cpu-total && cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\\x2d1\\x2dtest.scope/cpuacct.stat Domain: 'test' cpu.time=627401622 cpu.user=420000000 cpu.system=100000000 cpu.cache.monitor.count=0 user 42 system 10 where it corresponds to what we can see directly in cgroups. I don't think there is a bug in libvirt.
I used the same guest in both build 4.4.3 and 4.4.3. here are some info I gathered reproding the same scenario in 4.4.3 and 4.4.2: ###### 4.4.3 ######### on guest: [root@vm-30-51 tmp]# cat /sys/fs/cgroup/cpu/cpuacct.stat user 17894 system 799 on host: cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d7\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat user 109 system 26370 virsh -r domstats golden_env_mixed_virtio_0|grep cpu cpu.time=314820356830 cpu.user=1120000000 cpu.system=313730000000 cpu.cache.monitor.count=0 vcpu.current=1 vcpu.maximum=16 vcpu.0.state=1 vcpu.0.time=312860000000 vcpu.0.wait=0 Rest API GET https://{{host}}/ovirt-engine/api/vms/11aa9a07-65e7-4f38-adb5-b27b141406a5/statistics <statistic href="/ovirt-engine/api/vms/11aa9a07-65e7-4f38-adb5-b27b141406a5/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4"> <name>cpu.current.guest</name> <description>CPU used by guest</description> <kind>gauge</kind> <type>decimal</type> <unit>percent</unit> <values> <value> <datum>0.13</datum> </value> </values> <vm href="/ovirt-engine/api/vms/11aa9a07-65e7-4f38-adb5-b27b141406a5" id="11aa9a07-65e7-4f38-adb5-b27b141406a5"/> </statistic> ####### 4.4.2 ####### on guest: [root@vm-30-102 tmp]# cat /sys/fs/cgroup/cpu/cpuacct.stat user 10188 system 637 on host: cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d33\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat user 118 system 243 [root@ocelot05 ~]# virsh -r domstats golden_env_mixed_virtio_0|grep cpu cpu.time=327226791989 cpu.user=1230000000 cpu.system=2540000000 cpu.cache.monitor.count=0 vcpu.current=1 vcpu.maximum=16 vcpu.0.state=1 vcpu.0.time=325660000000 vcpu.0.wait=0 Rest API: GET https://{{host}}/ovirt-engine/api/vms/c62b806f-4c17-496b-9b1c-244e9fbfb16e/statistics <statistic href="/ovirt-engine/api/vms/c62b806f-4c17-496b-9b1c-244e9fbfb16e/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4"> <name>cpu.current.guest</name> <description>CPU used by guest</description> <kind>gauge</kind> <type>decimal</type> <unit>percent</unit> <values> <value> <datum>50</datum> </value> </values> <vm href="/ovirt-engine/api/vms/c62b806f-4c17-496b-9b1c-244e9fbfb16e" id="c62b806f-4c17-496b-9b1c-244e9fbfb16e"/> </statistic>
(In reply to Pavel Hrdina from comment #8) > I don't think there is a bug in libvirt. hi Pavel, thanks for the detailed information. we compute the cpu usage in VDSM based on these stats: https://github.com/oVirt/vdsm/blob/v4.40.30/lib/vdsm/virt/vmstats.py#L139-L149 and yeah, it doesn't look like a libvirt bug.
(In reply to Polina from comment #9) > ###### 4.4.3 ######### > > on guest: > > [root@vm-30-51 tmp]# cat /sys/fs/cgroup/cpu/cpuacct.stat > user 17894 > system 799 > > > on host: > > cat > /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d7\x2dgoldenenvmixedvirtio0. > scope/cpuacct.stat > user 109 > system 26370 > > ####### 4.4.2 ####### > > on guest: > > [root@vm-30-102 tmp]# cat /sys/fs/cgroup/cpu/cpuacct.stat > user 10188 > system 637 > > on host: > > cat > /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d33\x2dgoldenenvmixedvirtio0. > scope/cpuacct.stat > > user 118 > system 243 ok, so the stats from within the guest look similar and the parts that I dropped from comment 9 show that libvirt is consistent with the cgroup stats. it seems that the difference between those stats come down the the system-cpu time on the hosts (26370 vs. 243) that can explain the difference we see at the api-level (the bigger the overhead is, the lower the guest cpu time is) I wouldn't pay much attention to the cpu usage that is presented in the UI because it is computed differently there. the 4.4.3 hosts run rhe 8.3 while the 4.4.2 ones run rhel 8.2, right? can we get access to two hosts from different versions to compare that the 4.4.3 hosts run that takes so much extra cpu time?
Yash, we need qemu guys to take a look at this and advice why we get different stats with RHEL 8.3 compared to RHEL 8.2 Could you suggest who can we approach?
Hi Arik, I assume you are talking about the cgroups cpuacct stats: /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d7\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat They are cgroups/kernel and have no direct relation to qemu. The accounting is done in the kernel. cpuacct is documented upstream here: https://www.kernel.org/doc/Documentation/cgroup-v1/cpuacct.txt Important part: "cpuacct.stat file lists a few statistics which further divide the CPU time obtained by the cgroup into user and system times. Currently the following statistics are supported: user: Time spent by tasks of the cgroup in user mode. system: Time spent by tasks of the cgroup in kernel mode. user and system are in USER_HZ unit. cpuacct controller uses percpu_counter interface to collect user and system times. This has two side effects: - It is theoretically possible to see wrong values for user and system times. This is because percpu_counter_read() on 32bit systems isn't safe against concurrent writes. - It is possible to see slightly outdated values for user and system times due to the batch processing nature of percpu_counter. "
Thanks Martin. Yeah, those are the stats that I referred to. For some reason, the system_time that we see there is (consistently) significantly higher on rhel8.3 compared to rhel8.2 and that affects our computation of the guest cpu stats We compute the guest_time as described in the first answer in [1]: guest_time = cpu_time - system_time - user_time In this bug it results in lower guest cpu usage than we previously reported and we have a different bug on reporting negative cpu usage. Maybe this change in behavior is unrelated to qemu but to cgroups/kernel but still we need someone to tell us if/how we need to adjust to this change in order to compute the guest cpu usage correctly [https://stackoverflow.com/questions/40468370/what-does-cpu-time-represent-exactly-in-libvirt]
The other bug I mentioned above is bz 1878803
I did some additional testing to figure out which downstream kernel introduced this change. My testing environment was RHEL-8.3 where I was starting a simple VM based on rhel-guest-image-8.2-290.x86_64.qcow2 and the only thing I changed between every test run was host kernel. By bisecting I managed to find the "broken" kernel: kernel-4.18.0-233.el8.x86_64 cat /sys/fs/cgroup/cpuacct/machine.slice/machine-qemu\\x2d1\\x2drhel.scope/cpuacct.stat user 199 system 1668 Using RHEL-8.3 with this kernel gave similar results as RHEL-8.2: kernel-4.18.0-232.el8.x86_64 cat /sys/fs/cgroup/cpuacct/machine.slice/machine-qemu\\x2d1\\x2drhel.scope/cpuacct.stat user 201 system 377 So my guess is that some patches backported to kernel-4.18.0-233.el8.x86_64 changed the way how system time is computed. The question now is whether the reported CPU system stat is correct and ovirt has to change the way how it calculates CPU usage or the change in kernel is incorrect.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
(In reply to Pavel Hrdina from comment #17) > So my guess is that some patches backported to kernel-4.18.0-233.el8.x86_64 > changed the way how system time is computed. Bug 1833096 looks like the relevant change with series of patches about time accounting Phil, I see you added yourself to this bug so you're aware. Any idea?
tested with baseurl=http://brew-task-repos.usersys.redhat.com/repos/scratch/pauld/kernel/4.18.0/232.el8.bz1833096_test1/$basearch/ The behavior is still incorrect. <statistic href="/ovirt-engine/api/vms/5271a6e3-73d7-4358-b5d7-84b31af24a9f/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4"> <name>cpu.current.guest</name> <description>CPU used by guest</description> <kind>gauge</kind> <type>decimal</type> <unit>percent</unit> <values> <value> <datum>0.27</datum> </value> </values> <vm href="/ovirt-engine/api/vms/5271a6e3-73d7-4358-b5d7-84b31af24a9f" id="5271a6e3-73d7-4358-b5d7-84b31af24a9f"/> </statistic> while expected <datum>100</datum> The numbers are similar to the previous test : on guest cat /sys/fs/cgroup/cpu/cpuacct.stat user 95280 system 36696 on host /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d2\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat user 229 system 141009 virsh -r domstats golden_env_mixed_virtio_0|grep cpu cpu.time=1533392880883 cpu.user=2370000000 cpu.system=1527770000000 cpu.cache.monitor.count=0 vcpu.current=1 vcpu.maximum=16 vcpu.0.state=1 vcpu.0.time=1524320000000 vcpu.0.wait=0
Hi, I've tested with this last kernel version and got the correct behavior. for 100% cpu loaded VM I got the expected statistic <statistic href="/ovirt-engine/api/vms/e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4"> <name>cpu.current.guest</name> <description>CPU used by guest</description> <kind>gauge</kind> <type>decimal</type> <unit>percent</unit> <values> <value> <datum>100</datum> </value> </values> <vm href="/ovirt-engine/api/vms/e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe" id="e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe"/> </statistic> on guest: cat /sys/fs/cgroup/cpu/cpuacct.stat user 18237 system 6679 on host : [root@puma42 yum.repos.d]# cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\\x2d1\\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat user 169 system 503 [root@puma42 yum.repos.d]# virsh -r domstats golden_env_mixed_virtio_0|grep cpu cpu.time=387282366516 cpu.user=1710000000 cpu.system=5100000000 cpu.cache.monitor.count=0 vcpu.current=1 vcpu.maximum=16 vcpu.0.state=1 vcpu.0.time=383910000000 vcpu.0.wait=0 Also performed the second test from https://bugzilla.redhat.com/show_bug.cgi?id=1876937#c0 and it also showed the correct values <datum>50</datum>
*** Bug 1878803 has been marked as a duplicate of this bug. ***
yes , I think no problem with that
Did the test with this patch . The value are not valid again . <statistic href="/ovirt-engine/api/vms/e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4"> <name>cpu.current.guest</name> <description>CPU used by guest</description> <kind>gauge</kind> <type>decimal</type> <unit>percent</unit> <values> <value> <datum>0.5</datum> </value> </values> <vm href="/ovirt-engine/api/vms/e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe" id="e7d66e4f-9d5b-407c-b746-0c3dd6c3c8fe"/> </statistic> [root@dhcp163-120 ~]# cat /sys/fs/cgroup/cpu/cpuacct.stat user 21937 system 8332 [root@puma42 ~]# cat /sys/fs/cgroup/cpu/machine.slice/machine-qemu\\x2d1\\x2dgoldenenvmixedvirtio0.scope/cpuacct.stat user 188 system 33842 [root@puma42 ~]# virsh -r domstats golden_env_mixed_virtio_0|grep cpu cpu.time=495255742261 cpu.user=1920000000 cpu.system=497340000000 cpu.cache.monitor.count=0 vcpu.current=1 vcpu.maximum=16 vcpu.0.state=1 vcpu.0.time=490690000000 vcpu.0.wait=0
the test in https://bugzilla.redhat.com/show_bug.cgi?id=1876937#c36 has been performed after installing comment #32 & #35: 1. updated baseurl=http://brew-task-repos.usersys.redhat.com/repos/scratch/pauld/kernel/4.18.0/232.el8.bz1833096_test3/$basearch/ in kernel.repo 2. dnf install kernel-4.18.0-232.el8.bz1833096_test3 3. rpm -iv kernel-5.9.0.ark_latest+-1.x86_64.rpm rpm -iv kernel-devel-5.9.0.ark_latest+-1.x86_64.rpm rpm -iv kernel-headers-5.9.0.ark_latest+-1.x86_64.rpm 4. host reboot
Hi, I repeated the test twice: uname -a Linux puma43.scl.lab.tlv.redhat.com 4.18.0-232.el8.bz1833096_test3.x86_64 #1 SMP Wed Oct 14 14:29:47 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux uname -a Linux puma43.scl.lab.tlv.redhat.com 5.9.0.ark_latest+ #1 SMP Wed Oct 14 17:27:46 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux root@puma43 rpms]# rpm -qa |grep kernel-5.9.0.ark_latest+-1.x86_64 kernel-5.9.0.ark_latest+-1.x86_64 [root@puma43 rpms]# rpm -qa |grep kernel-devel-5.9.0.ark_latest+-1.x86_64 kernel-devel-5.9.0.ark_latest+-1.x86_64 [root@puma43 rpms]# rpm -qa |grep kernel-headers-5.9.0.ark_latest+-1.x86_64 kernel-headers-5.9.0.ark_latest+-1.x86_64 Unfortunately, the results are the same , wrong statistic values in both tests
Thanks to Nitesh the host was updated. uname -a Linux puma42.scl.lab.tlv.redhat.com 4.18.0-240.el8.bz1876937.x86_64 #1 SMP Tue Oct 20 17:47:57 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux It works correctly . the statistics value for cpu.current.guest correct Tested bor both scenarios reported in https://bugzilla.redhat.com/show_bug.cgi?id=1876937#c0
Hi Polina, The problematic patch has been reverted in the 8.4 dev-test kernel(kernel-4.18.0-240.8.el8.dt2). I think it will be good if you or someone else can also verify this kernel in your environment to make sure that things are working as expected without any regression. PS: I do remember that you have tested a build that I provided previously, however, that build was on top of the 8.3 kernel. Thanks
The available latest RHV installation is on RHEL 8.3. It looks like we don't have an easy way to test the 8.4, it fails while installing host to engine.
This bug is in NEW status for ovirt 4.4.4. The bug has not been acknowledged to be a blocker, please re-target.
verified on ovirt-engine-4.4.4.5-0.10.el8ev.noarch kernel 4.18.0-240.10.1.el8_3.x86_64
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.