Bug 535798 (RHQ-2457) - Per CPU metric collection issues for multi-cpu platforms
Summary: Per CPU metric collection issues for multi-cpu platforms
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: RHQ-2457
Product: RHQ Project
Classification: Other
Component: Plugins
Version: 1.3
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: ---
Assignee: Jay Shaughnessy
QA Contact: Corey Welton
URL: http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks: JON231
TreeView+ depends on / blocked
 
Reported: 2009-10-06 16:33 UTC by Jay Shaughnessy
Modified: 2018-10-27 16:16 UTC (History)
3 users (show)

Fixed In Version: 2.4
Clone Of:
Environment:
All platforms Customer case 348308
Last Closed: 2010-08-12 16:57:07 UTC
Embargoed:


Attachments (Terms of Use)

Description Jay Shaughnessy 2009-10-06 16:33:00 UTC
In version 2.3 we moved to a SigarProxy singleton for interacting with Sigar.  This exacerbates an issue we have in PerCpu metric collection.  The fundamental issue is that Sigar has coarse-grained cpu metric collection, gathering all metrics for all cpu's at the same time.  Our metric model is fine grained, we want to be able to collect an individual metric for a specific resource at a resource-metric specific schedule interval.

The problem here is that for multiple CPU's (by default, the same metrics are enabled with the same schedules) we end up making sequential calls to Sigar's getCpuPercList().   Sigar's implementation is not wrong, but it's not conducive to our approach.  Sigar uses the previous and current cpu info gathering to compute the cpu usage percentages.  For the CPU-0 component things work well, the interval is (at least in the default settings case) what was set in the schedule. But, for CPU-1..N the cpu cycle diff may be 0. Meaning that the two cpu gathers performed by Sigar are identical since they are performed at practically identical times.  This results in divBy0 NaN values for the percentage metrics.  Even if NaN values are not returned the sample size (the cpu cycle diff) will be tiny and the values are probably not useful. The expectation is that the values reported for all CPU's represent the usage over the interval defined for the metric.

Prior to 2.3 each CpuComponent accessed its own Sigar instance, this prevented the issue to the level of severity that we are seeing now, although it may be the case that the metrics still did not report on the expected intervals.

Comment 1 Jay Shaughnessy 2009-10-06 16:36:03 UTC
The proposed solution is to not use Sigar's getCpuPerList call. Instead, to call getCpuList and cache the gathered  raw cpu results as necessary, in CpuInformation (per CpuComponent).  Then, use the cached raw values to perform our own percentage calculations.

Comment 2 Red Hat Bugzilla 2009-11-10 21:04:48 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-2457


Comment 3 Jay Shaughnessy 2009-11-13 16:42:57 UTC
This fixed in the way suggested above. It's in the 1.3.1 branch as well as in (git) head.

Comment 4 Corey Welton 2010-01-26 16:10:12 UTC
Repro steps?

Comment 5 Jay Shaughnessy 2010-01-26 16:38:13 UTC
To see this in 2.3 just import any multi-cpu platform (JON treats a core as a CPU, so multi-core is the same as multi-cpu) and let it collect metrics for a while.  When you go to look at the metrics for the individual CPU resources you should notice that all (or possible all but one) look very flaky. Missing values, bad values, etc.

After the fix the values should look much better.

Comment 6 Corey Welton 2010-01-27 20:16:33 UTC
QA Verified.  Values seem to be pretty good, now.

Comment 7 Corey Welton 2010-08-12 16:57:07 UTC
Mass-closure of verified bugs against JON.


Note You need to log in before you can comment on or make changes to this bug.