Created attachment 876864 [details] screenshot Description of problem: Users' Spice Sessions Activity (BR45) - "Avg User CPU Usage %" and "Max User CPU Usage %" - Need to investigate how these percentages can be higher than 100. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Please check how VDSM reports these and how the engine processes this. Probably over 100% in case more than one full core is utilized. Yaniv
Yaniv, what is the implementation plan here ? The displayed measurement(percentage) cannot be more than 100.
Yaniv, the columns we collect are cpu_sys & cpu_user in the engine table vm_statistics. I need you to check if it is collect as 100% capped percentage values or 100% pre-core. This check needs to be done in two level: 1. What units these VDSM reports in? 2. Is there any conversion when value in inserted to the table? Can you please check and reply with corresponding code sections? Yaniv
(vdsm/sampling.py:HostStatsThread.get():430) hs0, hs1 = self._samples[0], self._samples[-1] interval = hs1.timestamp - hs0.timestamp jiffies = (hs1.pidcpu.user - hs0.pidcpu.user) % (2 ** 32) stats['cpuUserVdsmd'] = jiffies / interval jiffies = (hs1.pidcpu.sys - hs0.pidcpu.sys) % (2 ** 32) stats['cpuSysVdsmd'] = jiffies / interval jiffies = (hs1.totcpu.user - hs0.totcpu.user) % (2 ** 32) stats['cpuUser'] = jiffies / interval / self._ncpus jiffies = (hs1.totcpu.sys - hs0.totcpu.sys) % (2 ** 32) stats['cpuSys'] = jiffies / interval / self._ncpus stats['cpuIdle'] = max(0.0, 100.0 - stats['cpuUser'] - stats['cpuSys']) stats['memUsed'] = hs1.memUsed stats['anonHugePages'] = hs1.anonHugePages stats['cpuLoad'] = hs1.cpuLoad i don't see any conversion with those values in engine side. hope it helps.
Please explain: jiffies = (hs1.totcpu.sys - hs0.totcpu.sys) % (2 ** 32) stats['cpuSys'] = jiffies / interval / self._ncpus Does this come out a percent caped at 100% or is it possible to be over 100% (with more than one core)? Yaniv
afaiu its not in percents. jiffies is 32bit value, divides in interval unit and again by number of cpus. in one of my hosts its cpuSys = '1.43' danken, if you can elaborate more about this calculation, both yanivs can learn from it
"jiffies" is a measure of the absolute time spent in kernel mode by the relevant qemu process during the measurement interval. (jiffies / interval) is the time spent per second. Since most of this time is used by vCPUs, it is then divided by _ncpus. I do not know qemu well enough, but think that in theory there could be sickly cases where this ends up more than 100%: if all vCPUs run amok, AND qemu uses an additional host cpu for non-vCPU tasks.
(In reply to Dan Kenigsberg from comment #7) > "jiffies" is a measure of the absolute time spent in kernel mode by the > relevant qemu process during the measurement interval. > > (jiffies / interval) is the time spent per second. Since most of this time > is used by vCPUs, it is then divided by _ncpus. > > I do not know qemu well enough, but think that in theory there could be > sickly cases where this ends up more than 100%: if all vCPUs run amok, AND > qemu uses an additional host cpu for non-vCPU tasks. So what is the unit here? percent capped at 100% in theory?
It's percentage, when the 100% is when N physical cpus exclusively serve your N vCPUs. I described a theoretical scenario where Vdsm could report more than 100%. (We should stop messing with this, and report absolute values to Engine, which makes much more sense for billing, see bug 1066570)
(In reply to Dan Kenigsberg from comment #9) > It's percentage, when the 100% is when N physical cpus exclusively serve > your N vCPUs. I described a theoretical scenario where Vdsm could report > more than 100%. > > (We should stop messing with this, and report absolute values to Engine, > which makes much more sense for billing, see bug 1066570) Ok then this is like we wanted capped to 100%. Bdagan, can you tell us how you got to these values, so we can check the calculation and reporting of VDSM causing more than 100%? Yaniv
Yaniv, When did I reported these values? I dohn't have any sequence generating % > 100. however, There is a closed BZ of negative mem - https://bugzilla.redhat.com/show_bug.cgi?id=866186.
(In reply to Barak Dagan from comment #11) > Yaniv, > When did I reported these values? I dohn't have any sequence generating % > > 100. > however, There is a closed BZ of negative mem - > https://bugzilla.redhat.com/show_bug.cgi?id=866186. See attachment. Yaniv
Don't know how to reprodue - that's not my screenshot.
(In reply to Barak Dagan from comment #13) > Don't know how to reprodue - that's not my screenshot. ok, you are correct. This came from rhev-tlv. it's the jenkins-ci vm. Dan, can you maybe investigate how this happens there? It's at almost 400% that is a lot more than a 100%. Yaniv
Average user cpu usage of %398 ?! It's unlikely that it's the theoretical over-100% I suggested above - but it could be - if qemu is really buggy. We could hide this odd case (in reports, Engine, or vdsm) but I'd rather have this issue reproduced and dug into. I do not see an obvious bug in Vdsm, and the real problem may lie even deeper (in qemu).
(In reply to Dan Kenigsberg from comment #15) > Average user cpu usage of %398 ?! It's unlikely that it's the theoretical > over-100% I suggested above - but it could be - if qemu is really buggy. > > We could hide this odd case (in reports, Engine, or vdsm) but I'd rather > have this issue reproduced and dug into. I do not see an obvious bug in > Vdsm, and the real problem may lie even deeper (in qemu). I think it's not that of a issue to test. You just need to look at the peak usage time of this VM. Can you connect us to someone from qemu? Yaniv
Is this condition easily reproducible? If so, please reproduce, see if `top` reports the same odd values. If it does, it's a qemu bug (and you should bug mst). If top is fine, and `vdsClient -s 0 getAllVmStats` is not - it's a vdsm bug. Please report details such as kernel and qemu versions, and qemu command line.
Shirly, Can we have a reproduced env and see which component is responsible for this bug. Please check again in RHEV-TLV
VDSM calculates values directly from libvirt's getCPUStats() in vdsm/virt/vm.py: 225 def _sampleCpu(self): 226 cpuStats = self._vm._dom.getCPUStats(True, 0) 227 return cpuStats[0] The first param to getCPUStats tells libvirt whether to return per-CPU stats, or their total sum. While each CPU is capped (after come calculations) to 100%, the sum is capped to (#CPU*100)% and can report values larger than 100. The engine code stores these CPU value calculations in vm_statistics.cpu_user and vm_statistics.cpu_sys which are displayed in the attached picture. The engine itself expects these values to be larger than 100 and holds another column in the DB, vm_statistics.usage_cpu_percent, which it uses to display the CPU% in webadmin and is calculated the following way (VM.java, 1283): Double percent = (getCpuSys() + getCpuUser()) / vm.getNumOfCpus(); setUsageCpuPercent(percent.intValue());
Arthur, I fixed the dwh view in the engine db so future values of 'user_cpu_usage_percent' and 'sys_cpu_usage_percent' should be correct according to the number of cpu's of the vm. Do you think we should retroactively update the values of 'user_cpu_usage_percent', 'max_user_cpu_usage_percent', 'system_cpu_usage_percent', 'max_system_cpu_usage_percent' in the history db? Please keep in mind these values need to be updated in 5 tables and it might be a heavy transaction (1 samples table, 2 hourly tables, 2 daily tables).
Following IRC chat with Shirly, update should occur retroactivity.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0177.html