752210 – When MeasurementCollectorRunner logs metric collections is falling behind it should include metric details

Bug 752210 - When MeasurementCollectorRunner logs metric collections is falling behind it should include metric details

Summary: When MeasurementCollectorRunner logs metric collections is falling behind it ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Plugin Container
Sub Component:
Version:	3.0.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	JON 3.1.1
Assignee:	Jay Shaughnessy
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	829433
TreeView+	depends on / blocked

Reported:	2011-11-08 20:26 UTC by Larry O'Leary
Modified:	2018-11-26 18:22 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	829433 (view as bug list)
Environment:
Last Closed:	2013-09-03 15:16:07 UTC
Embargoed:

Attachments	(Terms of Use)

Description Larry O'Leary 2011-11-08 20:26:36 UTC

When metric collection is falling behind, MeasurementReport org.rhq.core.pc.measurement.MeasurementCollectorRunner.call() logs a message indicating how far it is behind but it does not indicate for which metrics or metric scheduled making it useless for the end-user. 

This message should identify which resource/metrics/schedules are impacted so the user can easily tune the metric collection schedules to address the issue if it is due to overly aggressive collection schedules.

Comment 1 Charles Crouch 2012-03-29 18:45:02 UTC

We should look to see how big a change this would be.

Comment 3 Jay Shaughnessy 2012-06-06 17:27:47 UTC

There does exist already an INFO level log message for a single resource's metric collection if it takes longer than 2 seconds.  This helps identify a problematic resource, or perhaps type.  The Agent log message looks like;

INFO  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- [PERF] Collection of measurements for [Resource[id=10002, uuid=0889a335-a2f4-46dd-92ce-684a9d01c382, type={RHQAgent}RHQ Agent, key=jshaughn RHQ Agent, name=RHQ Agent, parent=jshaughn, version=4.5.0-SNAPSHOT]] (component=[org.rhq.plugins.agent.AgentServerComponent@b0eed3]) took [45835]ms

It tells you the resource and the amount of time the collection took.  

Now, in debug mode it will append the requested metrics, for example:

INFO  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- [PERF] Collection of measurements for [Resource[id=10002, uuid=0889a335-a2f4-46dd-92ce-684a9d01c382, type={RHQAgent}RHQ Agent, key=jshaughn RHQ Agent, name=RHQ Agent, parent=jshaughn, version=4.5.0-SNAPSHOT]] (component=[org.rhq.plugins.agent.AgentServerComponent@168c8e7]) took [39203]ms for requests: [ScheduledMeasurementInfo[res=10002, name=AgentServerClockDifference, sched=10017], ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeReceived, sched=10018], ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]]


This is useful for detecting an overall slow metric collection but metric collection can fall behind with or without long individual collections.  It may be that the agent is getting overwhelmed with metric collections, none of which may be individually slow.  If metric collection falls behind such that collection requests are not started within 30s of their scheduled time we already log a debug message like this:

DEBUG [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection is falling behind... Missed requested time by [49280ms]

Now, in debug mode, we will additionally provide up to the most recent 500 individual metric requests in a subsequent message, like this:

DEBUG [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- The most recent measurement requests prior to this detected delay: 
  Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]
  Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeReceived, sched=10018]
  Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AgentServerClockDifference, sched=10017]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:FreeMemory, sched=10049]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:TotalMemory, sched=10051]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:ActiveThreadCount, sched=10047]
  Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]

etc...

Comment 4 Jay Shaughnessy 2012-06-06 18:02:35 UTC

This is fixed in the upstream bug 829433.  Needs cherry picking when branch is available.

Comment 5 Jay Shaughnessy 2012-07-17 18:12:07 UTC

release/jon3.1.x commit: 135ccd4b749bd21e2fdf90f68652da6c6496f8b7

    Add more information in debug mode to help determine what may be holding
    up the agent metric collections.
    
    Cherry pick of Master 54e5a2327ed744e220aadceb9868e8a84e95b322

Comment 6 John Sanda 2012-08-02 18:51:54 UTC

JON 3.1.1 ER1 build is available. Moving to ON_QA.

https://brewweb.devel.redhat.com/buildinfo?buildID=226942

Comment 7 Jay Shaughnessy 2012-08-15 16:19:49 UTC

See upstream bug 829433 for test notes...

Comment 8 Armine Hovsepyan 2012-08-17 07:29:20 UTC

verified!

Thanks Jay.

Comment 9 Heiko W. Rupp 2013-09-03 15:16:07 UTC

Bulk closing of old issues in VERIFIED state.

Note You need to log in before you can comment on or make changes to this bug.