1078897 – User and System CPU Usage have values higher than 100%

Bug 1078897 - User and System CPU Usage have values higher than 100%

Summary: User and System CPU Usage have values higher than 100%

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine-dwh
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Shirly Radco
QA Contact:	Petr Matyáš
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:	1098779 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-03-20 13:58 UTC by Shirly Radco
Modified:	2016-02-10 19:29 UTC (History)
CC List:	16 users (show)
Fixed In Version:	oVirt 3.5 Alpha 1
Doc Type:	Bug Fix
Doc Text:	Previously, the CPU usage statistics did not correctly calculate the CPU usage (expressed as a percent) when there was more than one CPU core being used within the same host. Now,the CPU usage statistics for hosts with more than one core correctly takes into account the number of cores on the hosts when calculating overall CPU usage and the upper limit will now display 100%.
Clone Of:
Clones:	1098779 (view as bug list)
Environment:
Last Closed:	2015-02-11 18:14:37 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
screenshot (165.07 KB, image/png) 2014-03-20 13:58 UTC, Shirly Radco	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2015:0177	normal	SHIPPED_LIVE	rhevm-dwh 3.5 bug fix and enhancement update	2015-02-11 23:11:50 UTC
oVirt gerrit	27556	master	MERGED	history: update user and system cpu usage percent	Never
oVirt gerrit	27559	master	MERGED	history: Updated calculation of cpu usage	Never
oVirt gerrit	27676	ovirt-engine-3.4	MERGED	history: update user and system cpu usage percent	Never
oVirt gerrit	27683	ovirt-engine-3.4	MERGED	history: Updated calculation of cpu usage	Never
oVirt gerrit	27719	master	MERGED	history: modified 03_05_0020 upgrade file	Never

Description Shirly Radco 2014-03-20 13:58:40 UTC

Created attachment 876864 [details]
screenshot

Description of problem:

Users' Spice Sessions Activity (BR45) - 
"Avg User CPU Usage %" and "Max User CPU Usage %" -
Need to investigate how these percentages can be higher than 100. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yaniv Lavi 2014-03-20 17:21:43 UTC

Please check how VDSM reports these and how the engine processes this.
Probably over 100% in case more than one full core is utilized.



Yaniv

Comment 2 Arthur Berezin 2014-03-23 09:38:37 UTC

Yaniv, what is the implementation plan here ? 
The displayed measurement(percentage) cannot be more than 100.

Comment 3 Yaniv Lavi 2014-03-24 06:00:24 UTC

Yaniv, the columns we collect are cpu_sys & cpu_user in the engine table vm_statistics. I need you to check if it is collect as 100% capped percentage values or 100% pre-core. This check needs to be done in two level:
1. What units these VDSM reports in?
2. Is there any conversion when value in inserted to the table?

Can you please check and reply with corresponding code sections?

Yaniv

Comment 4 Yaniv Bronhaim 2014-03-24 07:27:01 UTC

(vdsm/sampling.py:HostStatsThread.get():430)
                                                      
        hs0, hs1 = self._samples[0], self._samples[-1]                          
        interval = hs1.timestamp - hs0.timestamp                                
        jiffies = (hs1.pidcpu.user - hs0.pidcpu.user) % (2 ** 32)               
        stats['cpuUserVdsmd'] = jiffies / interval                              
        jiffies = (hs1.pidcpu.sys - hs0.pidcpu.sys) % (2 ** 32)                 
        stats['cpuSysVdsmd'] = jiffies / interval                               
                                                                                
        jiffies = (hs1.totcpu.user - hs0.totcpu.user) % (2 ** 32)               
        stats['cpuUser'] = jiffies / interval / self._ncpus                     
        jiffies = (hs1.totcpu.sys - hs0.totcpu.sys) % (2 ** 32)                 
        stats['cpuSys'] = jiffies / interval / self._ncpus                      
        stats['cpuIdle'] = max(0.0,                                             
                               100.0 - stats['cpuUser'] - stats['cpuSys'])      
        stats['memUsed'] = hs1.memUsed                                          
        stats['anonHugePages'] = hs1.anonHugePages                              
        stats['cpuLoad'] = hs1.cpuLoad 

i don't see any conversion with those values in engine side.

hope it helps.

Comment 5 Yaniv Lavi 2014-03-24 09:09:17 UTC

Please explain:
jiffies = (hs1.totcpu.sys - hs0.totcpu.sys) % (2 ** 32)
stats['cpuSys'] = jiffies / interval / self._ncpus 

Does this come out a percent caped at 100% or is it possible to be over 100% (with more than one core)?


Yaniv

Comment 6 Yaniv Bronhaim 2014-03-24 10:07:52 UTC

afaiu its not in percents. jiffies is 32bit value, divides in interval unit and again by number of cpus. in one of my hosts its cpuSys = '1.43'

danken, if you can elaborate more about this calculation, both yanivs can learn from it

Comment 7 Dan Kenigsberg 2014-03-24 15:04:05 UTC

"jiffies" is a measure of the absolute time spent in kernel mode by the relevant qemu process during the measurement interval.

(jiffies / interval) is the time spent per second. Since most of this time is used by vCPUs, it is then divided by _ncpus.

I do not know qemu well enough, but think that in theory there could be sickly cases where this ends up more than 100%: if all vCPUs run amok, AND qemu uses an additional host cpu for non-vCPU tasks.

Comment 8 Yaniv Lavi 2014-03-24 16:16:24 UTC

(In reply to Dan Kenigsberg from comment #7)
> "jiffies" is a measure of the absolute time spent in kernel mode by the
> relevant qemu process during the measurement interval.
> 
> (jiffies / interval) is the time spent per second. Since most of this time
> is used by vCPUs, it is then divided by _ncpus.
> 
> I do not know qemu well enough, but think that in theory there could be
> sickly cases where this ends up more than 100%: if all vCPUs run amok, AND
> qemu uses an additional host cpu for non-vCPU tasks.

So what is the unit here? percent capped at 100% in theory?

Comment 9 Dan Kenigsberg 2014-03-24 18:19:11 UTC

It's percentage, when the 100% is when N physical cpus exclusively serve your N vCPUs. I described a theoretical scenario where Vdsm could report more than 100%.

(We should stop messing with this, and report absolute values to Engine, which makes much more sense for billing, see bug 1066570)

Comment 10 Yaniv Lavi 2014-03-25 04:35:05 UTC

(In reply to Dan Kenigsberg from comment #9)
> It's percentage, when the 100% is when N physical cpus exclusively serve
> your N vCPUs. I described a theoretical scenario where Vdsm could report
> more than 100%.
> 
> (We should stop messing with this, and report absolute values to Engine,
> which makes much more sense for billing, see bug 1066570)

Ok then this is like we wanted capped to 100%.
Bdagan, can you tell us how you got to these values, so we can check the calculation and reporting of VDSM causing more than 100%?




Yaniv

Comment 11 Barak Dagan 2014-03-25 13:05:48 UTC

Yaniv,
When did I reported these values? I dohn't have any sequence generating % > 100.
however, There is a closed BZ of negative mem - https://bugzilla.redhat.com/show_bug.cgi?id=866186.

Comment 12 Yaniv Lavi 2014-03-25 16:14:13 UTC

(In reply to Barak Dagan from comment #11)
> Yaniv,
> When did I reported these values? I dohn't have any sequence generating % >
> 100.
> however, There is a closed BZ of negative mem -
> https://bugzilla.redhat.com/show_bug.cgi?id=866186.

See attachment.



Yaniv

Comment 13 Barak Dagan 2014-03-25 16:23:05 UTC

Don't know how to reprodue - that's not my screenshot.

Comment 14 Yaniv Lavi 2014-03-25 16:30:01 UTC

(In reply to Barak Dagan from comment #13)
> Don't know how to reprodue - that's not my screenshot.

ok, you are correct. This came from rhev-tlv. it's the jenkins-ci vm.
Dan, can you maybe investigate how this happens there? It's at almost 400% that is a lot more than a 100%.



Yaniv

Comment 15 Dan Kenigsberg 2014-04-10 13:06:52 UTC

Average user cpu usage of %398 ?! It's unlikely that it's the theoretical over-100% I suggested above - but it could be - if qemu is really buggy.

We could hide this odd case (in reports, Engine, or vdsm) but I'd rather have this issue reproduced and dug into. I do not see an obvious bug in Vdsm, and the real problem may lie even deeper (in qemu).

Comment 16 Yaniv Lavi 2014-04-10 15:40:38 UTC

(In reply to Dan Kenigsberg from comment #15)
> Average user cpu usage of %398 ?! It's unlikely that it's the theoretical
> over-100% I suggested above - but it could be - if qemu is really buggy.
> 
> We could hide this odd case (in reports, Engine, or vdsm) but I'd rather
> have this issue reproduced and dug into. I do not see an obvious bug in
> Vdsm, and the real problem may lie even deeper (in qemu).

I think it's not that of a issue to test. You just need to look at the peak usage time of this VM. Can you connect us to someone from qemu?



Yaniv

Comment 17 Dan Kenigsberg 2014-04-10 17:50:05 UTC

Is this condition easily reproducible? If so, please reproduce, see if `top` reports the same odd values. If it does, it's a qemu bug (and you should bug mst). If top is fine, and `vdsClient -s 0 getAllVmStats` is not - it's a vdsm bug.

Please report details such as kernel and qemu versions, and qemu command line.

Comment 18 Barak 2014-04-15 16:47:21 UTC

Shirly,

Can we have a reproduced env and see which component is responsible for this bug.
Please check again in RHEV-TLV

Comment 20 Dima Kuznetsov 2014-04-22 15:06:25 UTC

VDSM calculates values directly from libvirt's getCPUStats() in vdsm/virt/vm.py:

 225     def _sampleCpu(self):                                                       
 226         cpuStats = self._vm._dom.getCPUStats(True, 0)                           
 227         return cpuStats[0] 

The first param to getCPUStats tells libvirt whether to return per-CPU stats, or their total sum. While each CPU is capped (after come calculations) to 100%, the sum is capped to (#CPU*100)% and can report values larger than 100.

The engine code stores these CPU value calculations in vm_statistics.cpu_user and vm_statistics.cpu_sys which are displayed in the attached picture. The engine itself expects these values to be larger than 100 and holds another column in the DB, vm_statistics.usage_cpu_percent, which it uses to display the CPU% in webadmin and is calculated the following way (VM.java, 1283):

Double percent = (getCpuSys() + getCpuUser()) / vm.getNumOfCpus();
setUsageCpuPercent(percent.intValue());

Comment 21 Shirly Radco 2014-05-11 06:57:37 UTC

Arthur, I fixed the dwh view in the engine db so future values of 'user_cpu_usage_percent' and 'sys_cpu_usage_percent' should be correct according to the number of cpu's of the vm.


Do you think we should retroactively update the values of 'user_cpu_usage_percent', 'max_user_cpu_usage_percent', 'system_cpu_usage_percent', 'max_system_cpu_usage_percent' in the history db?
 
Please keep in mind these values need to be updated in 5 tables and it might be a heavy transaction (1 samples table, 2 hourly tables, 2 daily tables).

Comment 23 Arthur Berezin 2014-05-19 09:42:46 UTC

Following IRC chat with Shirly, update should occur retroactivity.

Comment 26 errata-xmlrpc 2015-02-11 18:14:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0177.html

Note You need to log in before you can comment on or make changes to this bug.