Bug 999615

Summary: get_ksmd_cpu_usage returns incorrect results
Product: [Retired] oVirt Reporter: John Taylor <jtt77777>
Component: momAssignee: Adam Litke <alitke>
Status: CLOSED CURRENTRELEASE QA Contact: Lukas Svaty <lsvaty>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.2CC: dfediuck, lsvaty, yeylon
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sla
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 999973 (view as bug list) Environment:
Last Closed: 2013-09-23 12:14:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 999973    

Description John Taylor 2013-08-21 17:35:52 UTC
Description of problem:
When using mom on a ovirt managed host, and when it runs ksm because of mom policy, vdsClient getVdsStats returns every growing value for ksmCpu, which is supposed to be a cpu percentage, and eventually breaks ovirt-engine-dwhd because it becomes greater than an int.

Version-Release number of selected component (if applicable):
ovirt engine 3.2.2 - 1.1.fc18
vdsm 4.10.3 - 17.fc18
mom  0.3.0 - 1.fc18

How reproducible:

run an ovirt managed host with memory pressure to start ksm. Run vdsClient -s 0 getVdsStats repeatedly to watch the ksmCpu value increase over time.


Steps to Reproduce:
1.
2.
3.

Actual results:
should be percentage of cpu, but grows unbounded

Expected results:


Additional info:

It looks like a bug in mom's HostKSM.py where last_jiff used to calculate difference in jiffies is never reset.
A change to set last_jiff to curr_jiff in get_ksmd_cpu_usage fixes it for me

[root@vm7 Collectors]# diff -C 4 HostKSM.py~ HostKSM.py
*** HostKSM.py~ 2012-10-05 13:37:16.000000000 -0400
--- HostKSM.py  2013-08-21 13:09:49.782064019 -0400
***************
*** 71,78 ****
--- 71,79 ----
          # wrap-around into account.
          interval_jiffs = (cur_jiff - self.last_jiff) % 2**32
          total_jiffs = os.sysconf('SC_CLK_TCK') * self.interval
          # Calculate percentage of total jiffies during this interval.
+         self.last_jiff = cur_jiff
          return 100 * interval_jiffs / total_jiffs

      def get_shareable_mem(self):
          """

Comment 1 Adam Litke 2013-08-22 13:11:32 UTC
Thanks for the detailed report.  I agree with your assessment.  Please see http://gerrit.ovirt.org/#/c/18420/ for the suggested fix.

Comment 2 Adam Litke 2013-08-23 13:29:50 UTC
I have built new packages with this fix incorporated.  Can someone confirm that it fixes the problem in the original environment?

https://koji.fedoraproject.org/koji/packageinfo?packageID=12742

mom-0.3.2-5

Comment 3 Lukas Svaty 2013-08-30 18:32:23 UTC
fixed in mom-0.3.2 see BZ#999973 in 3.3
can I set it to VERIFIED or test in 3.2 too?

Comment 4 Lukas Svaty 2013-09-05 10:42:44 UTC
moving to verified as this was fixed and tested on mom-0.3.2

Comment 5 Itamar Heim 2013-09-23 12:14:48 UTC
bulk closing, assuming verified bugs are in 3.3.