Bug 752210 - When MeasurementCollectorRunner logs metric collections is falling behind it should include metric details
When MeasurementCollectorRunner logs metric collections is falling behind it ...
Status: CLOSED CURRENTRELEASE
Product: RHQ Project
Classification: Other
Component: Plugin Container (Show other bugs)
3.0.1
Unspecified Unspecified
medium Severity high (vote)
: ---
: JON 3.1.1
Assigned To: Jay Shaughnessy
Mike Foley
:
Depends On:
Blocks: 829433
  Show dependency treegraph
 
Reported: 2011-11-08 15:26 EST by Larry O'Leary
Modified: 2013-09-03 11:16 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 829433 (view as bug list)
Environment:
Last Closed: 2013-09-03 11:16:07 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Larry O'Leary 2011-11-08 15:26:36 EST
When metric collection is falling behind, MeasurementReport org.rhq.core.pc.measurement.MeasurementCollectorRunner.call() logs a message indicating how far it is behind but it does not indicate for which metrics or metric scheduled making it useless for the end-user. 

This message should identify which resource/metrics/schedules are impacted so the user can easily tune the metric collection schedules to address the issue if it is due to overly aggressive collection schedules.
Comment 1 Charles Crouch 2012-03-29 14:45:02 EDT
We should look to see how big a change this would be.
Comment 3 Jay Shaughnessy 2012-06-06 13:27:47 EDT
There does exist already an INFO level log message for a single resource's metric collection if it takes longer than 2 seconds.  This helps identify a problematic resource, or perhaps type.  The Agent log message looks like;

INFO  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- [PERF] Collection of measurements for [Resource[id=10002, uuid=0889a335-a2f4-46dd-92ce-684a9d01c382, type={RHQAgent}RHQ Agent, key=jshaughn RHQ Agent, name=RHQ Agent, parent=jshaughn, version=4.5.0-SNAPSHOT]] (component=[org.rhq.plugins.agent.AgentServerComponent@b0eed3]) took [45835]ms

It tells you the resource and the amount of time the collection took.  

Now, in debug mode it will append the requested metrics, for example:

INFO  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- [PERF] Collection of measurements for [Resource[id=10002, uuid=0889a335-a2f4-46dd-92ce-684a9d01c382, type={RHQAgent}RHQ Agent, key=jshaughn RHQ Agent, name=RHQ Agent, parent=jshaughn, version=4.5.0-SNAPSHOT]] (component=[org.rhq.plugins.agent.AgentServerComponent@168c8e7]) took [39203]ms for requests: [ScheduledMeasurementInfo[res=10002, name=AgentServerClockDifference, sched=10017], ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeReceived, sched=10018], ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]]


This is useful for detecting an overall slow metric collection but metric collection can fall behind with or without long individual collections.  It may be that the agent is getting overwhelmed with metric collections, none of which may be individually slow.  If metric collection falls behind such that collection requests are not started within 30s of their scheduled time we already log a debug message like this:

DEBUG [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection is falling behind... Missed requested time by [49280ms]

Now, in debug mode, we will additionally provide up to the most recent 500 individual metric requests in a subsequent message, like this:

DEBUG [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- The most recent measurement requests prior to this detected delay: 
  Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]
  Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeReceived, sched=10018]
  Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AgentServerClockDifference, sched=10017]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:FreeMemory, sched=10049]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:TotalMemory, sched=10051]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:ActiveThreadCount, sched=10047]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]

etc...
Comment 4 Jay Shaughnessy 2012-06-06 14:02:35 EDT
This is fixed in the upstream bug 829433.  Needs cherry picking when branch is available.
Comment 5 Jay Shaughnessy 2012-07-17 14:12:07 EDT
release/jon3.1.x commit: 135ccd4b749bd21e2fdf90f68652da6c6496f8b7

    Add more information in debug mode to help determine what may be holding
    up the agent metric collections.
    
    Cherry pick of Master 54e5a2327ed744e220aadceb9868e8a84e95b322
Comment 6 John Sanda 2012-08-02 14:51:54 EDT
JON 3.1.1 ER1 build is available. Moving to ON_QA.

https://brewweb.devel.redhat.com/buildinfo?buildID=226942
Comment 7 Jay Shaughnessy 2012-08-15 12:19:49 EDT
See upstream bug 829433 for test notes...
Comment 8 Armine Hovsepyan 2012-08-17 03:29:20 EDT
verified!

Thanks Jay.
Comment 9 Heiko W. Rupp 2013-09-03 11:16:07 EDT
Bulk closing of old issues in VERIFIED state.

Note You need to log in before you can comment on or make changes to this bug.