Bug 999973

Summary: get_ksmd_cpu_usage returns incorrect results
Product: Red Hat Enterprise Virtualization Manager Reporter: Doron Fediuck <dfediuck>
Component: momAssignee: Martin Sivák <msivak>
Status: CLOSED ERRATA QA Contact: Lukas Svaty <lsvaty>
Severity: unspecified Docs Contact: Cheryn Tan <chetan>
Priority: unspecified    
Version: 3.2.0CC: acathrow, dfediuck, iheim, jtt77777, pstehlik, rlandman, yeylon
Target Milestone: ---   
Target Release: 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sla
Fixed In Version: mom-0.3.2-6.el6ev Doc Type: Bug Fix
Doc Text:
Previously the HostKSM collector calculated the number of jiffies that have been used since the last collection period. However, the count accumulated indefinitely as it was never reset, which could lead to failure of ovirt-engine-dwhd. This issue is fixed by calculating the current number of jiffies used, so get_ksmd_cpu_usage returns the correct results.
Story Points: ---
Clone Of: 999615 Environment:
Last Closed: 2014-01-21 15:06:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 999615    
Bug Blocks:    

Description Doron Fediuck 2013-08-22 13:29:47 UTC
Duplicating for RHEV 3.3

+++ This bug was initially created as a clone of Bug #999615 +++

Description of problem:
When using mom on a ovirt managed host, and when it runs ksm because of mom policy, vdsClient getVdsStats returns every growing value for ksmCpu, which is supposed to be a cpu percentage, and eventually breaks ovirt-engine-dwhd because it becomes greater than an int.

Version-Release number of selected component (if applicable):
ovirt engine 3.2.2 - 1.1.fc18
vdsm 4.10.3 - 17.fc18
mom  0.3.0 - 1.fc18

How reproducible:

run an ovirt managed host with memory pressure to start ksm. Run vdsClient -s 0 getVdsStats repeatedly to watch the ksmCpu value increase over time.


Steps to Reproduce:
1.
2.
3.

Actual results:
should be percentage of cpu, but grows unbounded

Expected results:


Additional info:

It looks like a bug in mom's HostKSM.py where last_jiff used to calculate difference in jiffies is never reset.
A change to set last_jiff to curr_jiff in get_ksmd_cpu_usage fixes it for me

[root@vm7 Collectors]# diff -C 4 HostKSM.py~ HostKSM.py
*** HostKSM.py~ 2012-10-05 13:37:16.000000000 -0400
--- HostKSM.py  2013-08-21 13:09:49.782064019 -0400
***************
*** 71,78 ****
--- 71,79 ----
          # wrap-around into account.
          interval_jiffs = (cur_jiff - self.last_jiff) % 2**32
          total_jiffs = os.sysconf('SC_CLK_TCK') * self.interval
          # Calculate percentage of total jiffies during this interval.
+         self.last_jiff = cur_jiff
          return 100 * interval_jiffs / total_jiffs

      def get_shareable_mem(self):
          """

--- Additional comment from Adam Litke on 2013-08-22 16:11:32 IDT ---

Thanks for the detailed report.  I agree with your assessment.  Please see http://gerrit.ovirt.org/#/c/18420/ for the suggested fix.

Comment 4 errata-xmlrpc 2014-01-21 15:06:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0064.html