Bug 1878803

Summary: Negative values for cpu.current.guest statistics <datum>-0.010</datum>
Product: [oVirt] ovirt-engine Reporter: Polina <pagranat>
Component: BLL.VirtAssignee: Arik <ahadas>
Status: CLOSED DUPLICATE QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4.3.1CC: ahadas, bugs, lrotenbe, mzamazal
Target Milestone: ovirt-4.4.3Flags: pm-rhel: ovirt-4.4+
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-14 16:31:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Polina 2020-09-14 14:36:46 UTC
Description of problem: sometimes in v.4.4.3 the  cpu.current.guest statistics values returned by GET https://{{host}}/ovirt-engine/api/vms/{{vm_id}}/statistics
are negative

Version-Release number of selected component (if applicable):
ovirt-engine-4.4.3.2-0.19.el8ev.noarch
vdsm-http-4.40.28-1.el8ev.noarch
python3-libvirt-6.6.0-1.module+el8.3.0+7572+bcbf6b90.x86_64
ovirt-engine-4.4.3.1-0.7.el8ev.noarch
qemu-kvm-5.1.0-4.module+el8.3.0+7846+ae9b566f.x86_64
libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64

How reproducible:sometimes

Steps to Reproduce:
1. Configure VM created on the base of the last infra template  pin to host and cpu topology 0#0 (Host Resources tab): pin to host and cpu topology 0#0 (Host Resources tab)
2.Load CPU of the VM (could be done with while loop)
3.Send GET https://{{host}}/ovirt-engine/api/vms/{{vm_id}}/statistics

Actual results: negative value for cpu.current.guest
<statistic href="/ovirt-engine/api/vms/1d7c99b8-c636-4780-92ec-e3c56132de75/statistics/ef802239-b74a-329f-9955-be8fea6b50a4" id="ef802239-b74a-329f-9955-be8fea6b50a4">
        <name>cpu.current.guest</name>
        <description>CPU used by guest</description>
        <kind>gauge</kind>
        <type>decimal</type>
        <unit>percent</unit>
        <values>
            <value>
                <datum>-0.010</datum>
            </value>
        </values>
        <vm href="/ovirt-engine/api/vms/1d7c99b8-c636-4780-92ec-e3c56132de75" id="1d7c99b8-c636-4780-92ec-e3c56132de75"/>

the whole response http://pastebin.test.redhat.com/901923

Expected results:


Additional info:

Comment 1 Liran Rotenberg 2020-09-14 15:04:08 UTC
We saw it possible in our VDSM code, depends on the ratio. The values are monotonic increasing.
An example given by Arik:
Sample1:
cpu.time = 1000
cpu.user = 200
cpu.system=700
so cpu.guest = 100

Sample2:
cpu.time = 2000
cpu.user =300
cpu.sys = 1650
so cpu.guest=50

In VDSM calculation:
cpuUsage = last sys + last user = 1950.
cpu_sys = (last user - first user) + (last sys - first sys) = 1050.
cpuUser = last time - first time - cpu_sys = -50

It's not new, the question rising:
Do we tell that it may be negative (although it doesn't make sense) and should treat as 0 or to return 0 ourselves.

Comment 2 Milan Zamazal 2020-09-22 17:58:48 UTC
(In reply to Liran Rotenberg from comment #1)
> We saw it possible in our VDSM code, depends on the ratio. The values are
> monotonic increasing.
> An example given by Arik:
> Sample1:
> cpu.time = 1000
> cpu.user = 200
> cpu.system=700
> so cpu.guest = 100
> 
> Sample2:
> cpu.time = 2000
> cpu.user =300
> cpu.sys = 1650
> so cpu.guest=50

I fail to understand what cpu.user and cpu.system values mean. libvirt documentation is unclear how the values are related to the VM.

> In VDSM calculation:
> cpuUsage = last sys + last user = 1950.
> cpu_sys = (last user - first user) + (last sys - first sys) = 1050.
> cpuUser = last time - first time - cpu_sys = -50

The fact that cpuUser is negative for a loaded VM and cpuSys is 100 makes me to suspect that the computation in Vdsm is wrong. I can't reproduce the problem though and I get high cpuUser values and low cpuSys values as expected (with an older libvirt version).

> It's not new, the question rising:
> Do we tell that it may be negative (although it doesn't make sense) and
> should treat as 0 or to return 0 ourselves.

A negative value makes no sense but before trying to fix it, we should understand the exact meaning of the values and under which circumstances the problem can be reproduced.

Comment 3 Milan Zamazal 2020-09-24 07:32:30 UTC
(In reply to Milan Zamazal from comment #2)

> The fact that cpuUser is negative for a loaded VM and cpuSys is 100 makes me
> to suspect that the computation in Vdsm is wrong. I can't reproduce the
> problem though and I get high cpuUser values and low cpuSys values as
> expected (with an older libvirt version).

After upgrading libvirt and QEMU, I can reproduce the error. So I suspect a platform regression.

Comment 4 Arik 2020-10-14 16:31:13 UTC
(In reply to Milan Zamazal from comment #3)
> After upgrading libvirt and QEMU, I can reproduce the error. So I suspect a
> platform regression.

Indeed

*** This bug has been marked as a duplicate of bug 1876937 ***