| Summary: | When MeasurementCollectorRunner logs metric collections is falling behind it should include metric details | |||
|---|---|---|---|---|
| Product: | [Other] RHQ Project | Reporter: | Larry O'Leary <loleary> | |
| Component: | Plugin Container | Assignee: | Jay Shaughnessy <jshaughn> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | |
| Severity: | high | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 3.0.1 | CC: | ahovsepy, flo_bugzilla, hrupp, jshaughn | |
| Target Milestone: | --- | |||
| Target Release: | JON 3.1.1 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 829433 (view as bug list) | Environment: | ||
| Last Closed: | 2013-09-03 15:16:07 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 829433 | |||
|
Description
Larry O'Leary
2011-11-08 20:26:36 UTC
We should look to see how big a change this would be.
There does exist already an INFO level log message for a single resource's metric collection if it takes longer than 2 seconds. This helps identify a problematic resource, or perhaps type. The Agent log message looks like;
INFO [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- [PERF] Collection of measurements for [Resource[id=10002, uuid=0889a335-a2f4-46dd-92ce-684a9d01c382, type={RHQAgent}RHQ Agent, key=jshaughn RHQ Agent, name=RHQ Agent, parent=jshaughn, version=4.5.0-SNAPSHOT]] (component=[org.rhq.plugins.agent.AgentServerComponent@b0eed3]) took [45835]ms
It tells you the resource and the amount of time the collection took.
Now, in debug mode it will append the requested metrics, for example:
INFO [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- [PERF] Collection of measurements for [Resource[id=10002, uuid=0889a335-a2f4-46dd-92ce-684a9d01c382, type={RHQAgent}RHQ Agent, key=jshaughn RHQ Agent, name=RHQ Agent, parent=jshaughn, version=4.5.0-SNAPSHOT]] (component=[org.rhq.plugins.agent.AgentServerComponent@168c8e7]) took [39203]ms for requests: [ScheduledMeasurementInfo[res=10002, name=AgentServerClockDifference, sched=10017], ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeReceived, sched=10018], ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]]
This is useful for detecting an overall slow metric collection but metric collection can fall behind with or without long individual collections. It may be that the agent is getting overwhelmed with metric collections, none of which may be individually slow. If metric collection falls behind such that collection requests are not started within 30s of their scheduled time we already log a debug message like this:
DEBUG [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection is falling behind... Missed requested time by [49280ms]
Now, in debug mode, we will additionally provide up to the most recent 500 individual metric requests in a subsequent message, like this:
DEBUG [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- The most recent measurement requests prior to this detected delay:
Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]
Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeReceived, sched=10018]
Wed Jun 06 13:19:00 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AgentServerClockDifference, sched=10017]
Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:FreeMemory, sched=10049]
Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:TotalMemory, sched=10051]
Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10003, name=jboss.system:type=ServerInfo:ActiveThreadCount, sched=10047]
Wed Jun 06 13:18:25 EDT 2012 ScheduledMeasurementInfo[res=10002, name=AverageExecutionTimeSent, sched=10019]
etc...
This is fixed in the upstream bug 829433. Needs cherry picking when branch is available.
release/jon3.1.x commit: 135ccd4b749bd21e2fdf90f68652da6c6496f8b7
Add more information in debug mode to help determine what may be holding
up the agent metric collections.
Cherry pick of Master 54e5a2327ed744e220aadceb9868e8a84e95b322
JON 3.1.1 ER1 build is available. Moving to ON_QA. https://brewweb.devel.redhat.com/buildinfo?buildID=226942 See upstream bug 829433 for test notes... verified! Thanks Jay. Bulk closing of old issues in VERIFIED state. |