Bug 752210

Summary: When MeasurementCollectorRunner logs metric collections is falling behind it should include metric details
Product: [Other] RHQ Project Reporter: Larry O'Leary <loleary>
Component: Plugin ContainerAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0.1CC: ahovsepy, flo_bugzilla, hrupp, jshaughn
Target Milestone: ---   
Target Release: JON 3.1.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 829433 (view as bug list) Environment:
Last Closed: 2013-09-03 11:16:07 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 829433    

Description Larry O'Leary 2011-11-08 15:26:36 EST
When metric collection is falling behind, MeasurementReport org.rhq.core.pc.measurement.MeasurementCollectorRunner.call() logs a message indicating how far it is behind but it does not indicate for which metrics or metric scheduled making it useless for the end-user. 

This message should identify which resource/metrics/schedules are impacted so the user can easily tune the metric collection schedules to address the issue if it is due to overly aggressive collection schedules.
Comment 1 Charles Crouch 2012-03-29 14:45:02 EDT
We should look to see how big a change this would be.
Comment 3 Jay Shaughnessy 2012-06-06 13:27:47 EDT
There does exist already an INFO level log message for a single resource's metric collection if it takes longer than 2 seconds.  This helps identify a problematic resource, or perhaps type.  The Agent log message looks like;

INFO  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- [PERF] Collection of measurements for [Resource[id=10002, uuid=0889a335-a2f4-46dd-92ce-684a9d01c382, type={RHQAgent}RHQ Agent, key=jshaughn RHQ Agent, name=RHQ Agent, parent=jshaughn, version=4.5.0-SNAPSHOT]] (component=[org.rhq.plugins.agent.AgentServerComponent@b0eed3]) took [45835]ms

It tells you the resource and the amount of time the collection took.  

Now, in debug mode it will append the requested metrics, for example:

INFO  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- [PERF] Collection of measurements for [Resource[id=10002, uuid=0889a335-a2f4-46dd-92ce-684a9d01c382, type={RHQAgent}RHQ Agent, key=jshaughn RHQ Agent, name=RHQ Agent, parent=jshaughn, version=4.5.0-SNAPSHOT]] (component=[org.rhq.plugins.agent.AgentServerComponent@168c8e7]) took [39203]ms for requests: [ScheduledMeasurementInfo[res=10002, name=AgentServerClockDifference, sched=10017], ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeReceived, sched=10018], ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]]


This is useful for detecting an overall slow metric collection but metric collection can fall behind with or without long individual collections.  It may be that the agent is getting overwhelmed with metric collections, none of which may be individually slow.  If metric collection falls behind such that collection requests are not started within 30s of their scheduled time we already log a debug message like this:

DEBUG [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection is falling behind... Missed requested time by [49280ms]

Now, in debug mode, we will additionally provide up to the most recent 500 individual metric requests in a subsequent message, like this:

DEBUG [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- The most recent measurement requests prior to this detected delay: 
  Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]
  Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeReceived, sched=10018]
  Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AgentServerClockDifference, sched=10017]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:FreeMemory, sched=10049]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:TotalMemory, sched=10051]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:ActiveThreadCount, sched=10047]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]

etc...
Comment 4 Jay Shaughnessy 2012-06-06 14:02:35 EDT
This is fixed in the upstream bug 829433.  Needs cherry picking when branch is available.
Comment 5 Jay Shaughnessy 2012-07-17 14:12:07 EDT
release/jon3.1.x commit: 135ccd4b749bd21e2fdf90f68652da6c6496f8b7

    Add more information in debug mode to help determine what may be holding
    up the agent metric collections.
    
    Cherry pick of Master 54e5a2327ed744e220aadceb9868e8a84e95b322
Comment 6 John Sanda 2012-08-02 14:51:54 EDT
JON 3.1.1 ER1 build is available. Moving to ON_QA.

https://brewweb.devel.redhat.com/buildinfo?buildID=226942
Comment 7 Jay Shaughnessy 2012-08-15 12:19:49 EDT
See upstream bug 829433 for test notes...
Comment 8 Armine Hovsepyan 2012-08-17 03:29:20 EDT
verified!

Thanks Jay.
Comment 9 Heiko W. Rupp 2013-09-03 11:16:07 EDT
Bulk closing of old issues in VERIFIED state.