When metric collection is falling behind, MeasurementReport org.rhq.core.pc.measurement.MeasurementCollectorRunner.call() logs a message indicating how far it is behind but it does not indicate for which metrics or metric scheduled making it useless for the end-user. This message should identify which resource/metrics/schedules are impacted so the user can easily tune the metric collection schedules to address the issue if it is due to overly aggressive collection schedules.
We should look to see how big a change this would be.
There does exist already an INFO level log message for a single resource's metric collection if it takes longer than 2 seconds. This helps identify a problematic resource, or perhaps type. The Agent log message looks like; INFO [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- [PERF] Collection of measurements for [Resource[id=10002, uuid=0889a335-a2f4-46dd-92ce-684a9d01c382, type={RHQAgent}RHQ Agent, key=jshaughn RHQ Agent, name=RHQ Agent, parent=jshaughn, version=4.5.0-SNAPSHOT]] (component=[org.rhq.plugins.agent.AgentServerComponent@b0eed3]) took [45835]ms It tells you the resource and the amount of time the collection took. Now, in debug mode it will append the requested metrics, for example: INFO [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- [PERF] Collection of measurements for [Resource[id=10002, uuid=0889a335-a2f4-46dd-92ce-684a9d01c382, type={RHQAgent}RHQ Agent, key=jshaughn RHQ Agent, name=RHQ Agent, parent=jshaughn, version=4.5.0-SNAPSHOT]] (component=[org.rhq.plugins.agent.AgentServerComponent@168c8e7]) took [39203]ms for requests: [ScheduledMeasurementInfo[res=10002, name=AgentServerClockDifference, sched=10017], ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeReceived, sched=10018], ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]] This is useful for detecting an overall slow metric collection but metric collection can fall behind with or without long individual collections. It may be that the agent is getting overwhelmed with metric collections, none of which may be individually slow. If metric collection falls behind such that collection requests are not started within 30s of their scheduled time we already log a debug message like this: DEBUG [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection is falling behind... Missed requested time by [49280ms] Now, in debug mode, we will additionally provide up to the most recent 500 individual metric requests in a subsequent message, like this: DEBUG [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- The most recent measurement requests prior to this detected delay: Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019] Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeReceived, sched=10018] Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AgentServerClockDifference, sched=10017] Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:FreeMemory, sched=10049] Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:TotalMemory, sched=10051] Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:ActiveThreadCount, sched=10047] Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019] etc...
This is fixed in the upstream bug 829433. Needs cherry picking when branch is available.
release/jon3.1.x commit: 135ccd4b749bd21e2fdf90f68652da6c6496f8b7 Add more information in debug mode to help determine what may be holding up the agent metric collections. Cherry pick of Master 54e5a2327ed744e220aadceb9868e8a84e95b322
JON 3.1.1 ER1 build is available. Moving to ON_QA. https://brewweb.devel.redhat.com/buildinfo?buildID=226942
See upstream bug 829433 for test notes...
verified! Thanks Jay.
Bulk closing of old issues in VERIFIED state.