Bug 1078897

Summary: User and System CPU Usage have values higher than 100%
Product: Red Hat Enterprise Virtualization Manager Reporter: Shirly Radco <sradco>
Component: ovirt-engine-dwhAssignee: Shirly Radco <sradco>
Status: CLOSED ERRATA QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.3.0CC: aberezin, bazulay, danken, gklein, iheim, juwu, lnovich, lpeer, pstehlik, rbalakri, Rhev-m-bugs, sherold, sradco, ybronhei, yeylon, ylavi
Target Milestone: ---Keywords: ZStream
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: oVirt 3.5 Alpha 1 Doc Type: Bug Fix
Doc Text:
Previously, the CPU usage statistics did not correctly calculate the CPU usage (expressed as a percent) when there was more than one CPU core being used within the same host. Now,the CPU usage statistics for hosts with more than one core correctly takes into account the number of cores on the hosts when calculating overall CPU usage and the upper limit will now display 100%.
Story Points: ---
Clone Of:
: 1098779 (view as bug list) Environment:
Last Closed: 2015-02-11 18:14:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1098779, 1142923, 1156165    
Attachments:
Description Flags
screenshot none

Description Shirly Radco 2014-03-20 13:58:40 UTC
Created attachment 876864 [details]
screenshot

Description of problem:

Users' Spice Sessions Activity (BR45) - 
"Avg User CPU Usage %" and "Max User CPU Usage %" -
Need to investigate how these percentages can be higher than 100. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yaniv Lavi 2014-03-20 17:21:43 UTC
Please check how VDSM reports these and how the engine processes this.
Probably over 100% in case more than one full core is utilized.



Yaniv

Comment 2 Arthur Berezin 2014-03-23 09:38:37 UTC
Yaniv, what is the implementation plan here ? 
The displayed measurement(percentage) cannot be more than 100.

Comment 3 Yaniv Lavi 2014-03-24 06:00:24 UTC
Yaniv, the columns we collect are cpu_sys & cpu_user in the engine table vm_statistics. I need you to check if it is collect as 100% capped percentage values or 100% pre-core. This check needs to be done in two level:
1. What units these VDSM reports in?
2. Is there any conversion when value in inserted to the table?

Can you please check and reply with corresponding code sections?

Yaniv

Comment 4 Yaniv Bronhaim 2014-03-24 07:27:01 UTC
(vdsm/sampling.py:HostStatsThread.get():430)
                                                      
        hs0, hs1 = self._samples[0], self._samples[-1]                          
        interval = hs1.timestamp - hs0.timestamp                                
        jiffies = (hs1.pidcpu.user - hs0.pidcpu.user) % (2 ** 32)               
        stats['cpuUserVdsmd'] = jiffies / interval                              
        jiffies = (hs1.pidcpu.sys - hs0.pidcpu.sys) % (2 ** 32)                 
        stats['cpuSysVdsmd'] = jiffies / interval                               
                                                                                
        jiffies = (hs1.totcpu.user - hs0.totcpu.user) % (2 ** 32)               
        stats['cpuUser'] = jiffies / interval / self._ncpus                     
        jiffies = (hs1.totcpu.sys - hs0.totcpu.sys) % (2 ** 32)                 
        stats['cpuSys'] = jiffies / interval / self._ncpus                      
        stats['cpuIdle'] = max(0.0,                                             
                               100.0 - stats['cpuUser'] - stats['cpuSys'])      
        stats['memUsed'] = hs1.memUsed                                          
        stats['anonHugePages'] = hs1.anonHugePages                              
        stats['cpuLoad'] = hs1.cpuLoad 

i don't see any conversion with those values in engine side.

hope it helps.

Comment 5 Yaniv Lavi 2014-03-24 09:09:17 UTC
Please explain:
jiffies = (hs1.totcpu.sys - hs0.totcpu.sys) % (2 ** 32)
stats['cpuSys'] = jiffies / interval / self._ncpus 

Does this come out a percent caped at 100% or is it possible to be over 100% (with more than one core)?


Yaniv

Comment 6 Yaniv Bronhaim 2014-03-24 10:07:52 UTC
afaiu its not in percents. jiffies is 32bit value, divides in interval unit and again by number of cpus. in one of my hosts its cpuSys = '1.43'

danken, if you can elaborate more about this calculation, both yanivs can learn from it

Comment 7 Dan Kenigsberg 2014-03-24 15:04:05 UTC
"jiffies" is a measure of the absolute time spent in kernel mode by the relevant qemu process during the measurement interval.

(jiffies / interval) is the time spent per second. Since most of this time is used by vCPUs, it is then divided by _ncpus.

I do not know qemu well enough, but think that in theory there could be sickly cases where this ends up more than 100%: if all vCPUs run amok, AND qemu uses an additional host cpu for non-vCPU tasks.

Comment 8 Yaniv Lavi 2014-03-24 16:16:24 UTC
(In reply to Dan Kenigsberg from comment #7)
> "jiffies" is a measure of the absolute time spent in kernel mode by the
> relevant qemu process during the measurement interval.
> 
> (jiffies / interval) is the time spent per second. Since most of this time
> is used by vCPUs, it is then divided by _ncpus.
> 
> I do not know qemu well enough, but think that in theory there could be
> sickly cases where this ends up more than 100%: if all vCPUs run amok, AND
> qemu uses an additional host cpu for non-vCPU tasks.

So what is the unit here? percent capped at 100% in theory?

Comment 9 Dan Kenigsberg 2014-03-24 18:19:11 UTC
It's percentage, when the 100% is when N physical cpus exclusively serve your N vCPUs. I described a theoretical scenario where Vdsm could report more than 100%.

(We should stop messing with this, and report absolute values to Engine, which makes much more sense for billing, see bug 1066570)

Comment 10 Yaniv Lavi 2014-03-25 04:35:05 UTC
(In reply to Dan Kenigsberg from comment #9)
> It's percentage, when the 100% is when N physical cpus exclusively serve
> your N vCPUs. I described a theoretical scenario where Vdsm could report
> more than 100%.
> 
> (We should stop messing with this, and report absolute values to Engine,
> which makes much more sense for billing, see bug 1066570)

Ok then this is like we wanted capped to 100%.
Bdagan, can you tell us how you got to these values, so we can check the calculation and reporting of VDSM causing more than 100%?




Yaniv

Comment 11 Barak Dagan 2014-03-25 13:05:48 UTC
Yaniv,
When did I reported these values? I dohn't have any sequence generating % > 100.
however, There is a closed BZ of negative mem - https://bugzilla.redhat.com/show_bug.cgi?id=866186.

Comment 12 Yaniv Lavi 2014-03-25 16:14:13 UTC
(In reply to Barak Dagan from comment #11)
> Yaniv,
> When did I reported these values? I dohn't have any sequence generating % >
> 100.
> however, There is a closed BZ of negative mem -
> https://bugzilla.redhat.com/show_bug.cgi?id=866186.

See attachment.



Yaniv

Comment 13 Barak Dagan 2014-03-25 16:23:05 UTC
Don't know how to reprodue - that's not my screenshot.

Comment 14 Yaniv Lavi 2014-03-25 16:30:01 UTC
(In reply to Barak Dagan from comment #13)
> Don't know how to reprodue - that's not my screenshot.

ok, you are correct. This came from rhev-tlv. it's the jenkins-ci vm.
Dan, can you maybe investigate how this happens there? It's at almost 400% that is a lot more than a 100%.



Yaniv

Comment 15 Dan Kenigsberg 2014-04-10 13:06:52 UTC
Average user cpu usage of %398 ?! It's unlikely that it's the theoretical over-100% I suggested above - but it could be - if qemu is really buggy.

We could hide this odd case (in reports, Engine, or vdsm) but I'd rather have this issue reproduced and dug into. I do not see an obvious bug in Vdsm, and the real problem may lie even deeper (in qemu).

Comment 16 Yaniv Lavi 2014-04-10 15:40:38 UTC
(In reply to Dan Kenigsberg from comment #15)
> Average user cpu usage of %398 ?! It's unlikely that it's the theoretical
> over-100% I suggested above - but it could be - if qemu is really buggy.
> 
> We could hide this odd case (in reports, Engine, or vdsm) but I'd rather
> have this issue reproduced and dug into. I do not see an obvious bug in
> Vdsm, and the real problem may lie even deeper (in qemu).

I think it's not that of a issue to test. You just need to look at the peak usage time of this VM. Can you connect us to someone from qemu?



Yaniv

Comment 17 Dan Kenigsberg 2014-04-10 17:50:05 UTC
Is this condition easily reproducible? If so, please reproduce, see if `top` reports the same odd values. If it does, it's a qemu bug (and you should bug mst). If top is fine, and `vdsClient -s 0 getAllVmStats` is not - it's a vdsm bug.

Please report details such as kernel and qemu versions, and qemu command line.

Comment 18 Barak 2014-04-15 16:47:21 UTC
Shirly,

Can we have a reproduced env and see which component is responsible for this bug.
Please check again in RHEV-TLV

Comment 20 Dima Kuznetsov 2014-04-22 15:06:25 UTC
VDSM calculates values directly from libvirt's getCPUStats() in vdsm/virt/vm.py:

 225     def _sampleCpu(self):                                                       
 226         cpuStats = self._vm._dom.getCPUStats(True, 0)                           
 227         return cpuStats[0] 

The first param to getCPUStats tells libvirt whether to return per-CPU stats, or their total sum. While each CPU is capped (after come calculations) to 100%, the sum is capped to (#CPU*100)% and can report values larger than 100.

The engine code stores these CPU value calculations in vm_statistics.cpu_user and vm_statistics.cpu_sys which are displayed in the attached picture. The engine itself expects these values to be larger than 100 and holds another column in the DB, vm_statistics.usage_cpu_percent, which it uses to display the CPU% in webadmin and is calculated the following way (VM.java, 1283):

Double percent = (getCpuSys() + getCpuUser()) / vm.getNumOfCpus();
setUsageCpuPercent(percent.intValue());

Comment 21 Shirly Radco 2014-05-11 06:57:37 UTC
Arthur, I fixed the dwh view in the engine db so future values of 'user_cpu_usage_percent' and 'sys_cpu_usage_percent' should be correct according to the number of cpu's of the vm.


Do you think we should retroactively update the values of 'user_cpu_usage_percent', 'max_user_cpu_usage_percent', 'system_cpu_usage_percent', 'max_system_cpu_usage_percent' in the history db?
 
Please keep in mind these values need to be updated in 5 tables and it might be a heavy transaction (1 samples table, 2 hourly tables, 2 daily tables).

Comment 23 Arthur Berezin 2014-05-19 09:42:46 UTC
Following IRC chat with Shirly, update should occur retroactivity.

Comment 26 errata-xmlrpc 2015-02-11 18:14:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0177.html