Bug 783879 - [Platform Plug-in] FileSystemComponent returns invalid diskQueue metric due to calling refresh method of FileSystemInfo twice when gathering metrics
Summary: [Platform Plug-in] FileSystemComponent returns invalid diskQueue metric due t...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: RHQ Project
Classification: Other
Component: Plugins
Version: 4.2
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: JON 3.0.1
Assignee: Charles Crouch
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On: 752981
Blocks: jon310-sprint11, rhq44-sprint11
TreeView+ depends on / blocked
 
Reported: 2012-01-23 04:47 UTC by Charles Crouch
Modified: 2015-02-01 23:27 UTC (History)
7 users (show)

Fixed In Version: 4.3
Doc Type: Bug Fix
Doc Text:
Clone Of: 752981
Environment:
Last Closed: 2013-09-03 15:17:52 UTC
Embargoed:


Attachments (Terms of Use)
image of the verification attempt (35.33 KB, image/png)
2012-02-03 19:42 UTC, Mike Foley
no flags Details
Screenshot (97.00 KB, image/png)
2012-02-06 11:29 UTC, Sunil Kondkar
no flags Details

Description Charles Crouch 2012-01-23 04:47:16 UTC
+++ This bug was initially created as a clone of Bug #752981 +++

When void org.rhq.plugins.platform.FileSystemComponent.getValues(MeasurementReport report, Set<MeasurementScheduleRequest> metrics) throws Exception is executed it first creates an instance of FileSystemInfo on line 73[1] and then calls the FileSystemInfo refresh method on line 74. The call on line 74 is a duplicate of the refresh called during the creation of the FileSystemInfo object as this is what it does in its constructor[2].


Although this is minor, it does have a performance impact when many metrics are enabled, collected frequently, and there are multiple file systems on the platform being monitored. 

Not certain to which call should actually be occurring but this should probably be cleaned up.


[1]: http://git.fedorahosted.org/git?p=rhq/rhq.git;a=blob;f=modules/plugins/platform/src/main/java/org/rhq/plugins/platform/FileSystemComponent.java;h=9d68c239f3bd981cfeee41f6b40e9ca5b28c6b4c;hb=HEAD#l73
[2]: http://git.fedorahosted.org/git?p=rhq/rhq.git;a=blob;f=modules/core/native-system/src/main/java/org/rhq/core/system/FileSystemInfo.java;h=2bde8ee772e6e1cb9af65b75dc5f4713c7c38c52;hb=HEAD#l46

--- Additional comment from ian.springer on 2011-11-11 10:43:53 EST ---

Nice find.

Fixed in master - commit 68be8c6:

http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=68be8c6

--- Additional comment from loleary on 2011-11-11 10:54:24 EST ---

Upon further investigation the multiple executions actually result in invalid data coming back from the native library. The statistics that are gathered are time based with a time resolution of 1 second and some of the values are calculated based on the last "gather" time. If the refresh method is called in succession with a second or less between, the result is disk stats being calculated with:

 0.0 / 0 = NaN
 > 0.0 / 0 = Infinity

The result depends on the execution time.

Disk Queue test with delay of 0 returned 100 NaN results and 0 infinity results out of a total tests of 100
Disk Queue test with delay of 1 returned 100 NaN results and 0 infinity results out of a total tests of 100
Disk Queue test with delay of 5 returned 68 NaN results and 32 infinity results out of a total tests of 100
Disk Queue test with delay of 10 returned 70 NaN results and 29 infinity results out of a total tests of 100
Disk Queue test with delay of 50 returned 64 NaN results and 31 infinity results out of a total tests of 100
Disk Queue test with delay of 100 returned 51 NaN results and 39 infinity results out of a total tests of 100
Disk Queue test with delay of 500 returned 34 NaN results and 16 infinity results out of a total tests of 100
Disk Queue test with delay of 1000 returned 0 NaN results and 0 infinity results out of a total tests of 100
Disk Queue test with delay of 1001 returned 0 NaN results and 0 infinity results out of a total tests of 100
Disk Queue test with delay of 1010 returned 0 NaN results and 0 infinity results out of a total tests of 100
Disk Queue test with delay of 1050 returned 0 NaN results and 0 infinity results out of a total tests of 100
Disk Queue test with delay of 1100 returned 0 NaN results and 0 infinity results out of a total tests of 100
Disk Queue test with delay of 1500 returned 0 NaN results and 0 infinity results out of a total tests of 100
Disk Queue test with delay of 2000 returned 0 NaN results and 0 infinity results out of a total tests of 100

--- Additional comment from loleary on 2011-11-14 14:25:39 EST ---

Committed to release-3.0.1 as 96b03ecc9c77bd80e72792fd996b3a0ad6592229 - http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=96b03ecc9c77bd80e72792fd996b3a0ad6592229

--- Additional comment from lkrejci on 2011-12-21 15:41:03 EST ---

Larry, would you be able to provide repro steps for this issue?

--- Additional comment from loleary on 2011-12-21 23:43:19 EST ---

1. From a Linux Platform resource, expand and select File System -> /boot
2. Select the Monitor > Tables sub-tab
3. The Last value for Disk Queue should contain a number

Actual Result:
<no value / i.e. it is blank>

Expected Result:
0 or a positive number

Please note that by default this metric is collected once every 20 minutes. So, if you are tested a new installation, it is best to drop the collection schedule for this metric to 1 minute so you do not have as long to wait.

Comment 1 Charles Crouch 2012-01-23 04:49:36 UTC
This bug needs to get ported to the JON3.0.1 branch of RHQ (release/jon3.0.x), it should be tested in that branch by engineering and then pushed to ON-QA

Comment 2 Charles Crouch 2012-01-24 17:40:50 UTC
A fix for this issue went into JON2.4.2

Comment 4 Charles Crouch 2012-01-31 04:38:53 UTC
Switching to using MODIFIED for fixes that are in the appropriate but are waiting to get into a build.

Comment 5 Simeon Pinder 2012-02-03 15:03:04 UTC
Moving this to ON_QA as there is now a binary available to test with:
https://brewweb.devel.redhat.com//buildinfo?buildID=197202

Comment 6 Mike Foley 2012-02-03 19:42:11 UTC
Created attachment 559352 [details]
image of the verification attempt

specifically ... i do not see step #3 in the verification steps.  i do not see "disc queue"

1. From a Linux Platform resource, expand and select File System -> /boot
2. Select the Monitor > Tables sub-tab
3. The Last value for Disk Queue should contain a number

Comment 7 Mike Foley 2012-02-03 19:43:42 UTC
marking failed to verify.  specifically ... step #3 ... i do not see disk queue.  i have attached an image as evidence of my observation.

Comment 8 Ian Springer 2012-02-03 19:54:08 UTC
Mike-

The Disk Queue metric is not enabled by default, so you'll need to go to the Monitor>Schedules subtab and enable it. I'd also set its collection interval low to 30s, to make testing easier.

Here's the relevant line from the platform plugin's rhq-plugin.xml:

      <metric property="fileSystemUsage.diskQueue" displayName="Disk Queue" 
              description="The number of I/Os currently in progress"/>

If a metric element doesn't have defaultOn="true" and/or displayType="summary", then it is not enabled by default.

Comment 9 Sunil Kondkar 2012-02-06 11:28:35 UTC
Verified on 3.0.1.GA RC2 build (Build Number: b2cb23b:859b914)

Enabled Disk Queue metric and verified that is display a number in 'Last' value. Please refer the attached screenshot.

Comment 10 Sunil Kondkar 2012-02-06 11:29:20 UTC
Created attachment 559617 [details]
Screenshot

Comment 11 Mike Foley 2012-02-07 19:30:10 UTC
changing status of VERIFIED BZs for JON 2.4.2 and JON 3.0 to CLOSED/CURRENTRELEASE

Comment 12 Mike Foley 2012-02-07 19:30:32 UTC
marking VERIFIED BZs to CLOSED/CURRENTRELEASE

Comment 13 Charles Crouch 2012-02-13 02:55:37 UTC
Some how this issue, which is for the fix to go into JON3.0.1, also had a Target Release of JON2.4.2, which meant it got erroneously got set to CLOSED:CURRENTRELEASE. I got rid of the incorrect Target Release and set it back to VERIFIED, once JON3.0.1 is out it can go back to CLOSED:CURRENTRELEASE

Comment 15 Heiko W. Rupp 2013-09-03 15:17:52 UTC
Bulk closing of old issues in VERIFIED state.


Note You need to log in before you can comment on or make changes to this bug.