Bug 786889 - Agent plug-in container should log warnings and disable metric collection for metrics with invalid collection intervals
Agent plug-in container should log warnings and disable metric collection for...
Status: CLOSED CURRENTRELEASE
Product: JBoss Operations Network
Classification: JBoss
Component: Monitoring -- Other (Show other bugs)
JON 3.0.1
All All
unspecified Severity medium
: ER01
: JON 3.2.0
Assigned To: Jay Shaughnessy
Mike Foley
:
Depends On: 865160
Blocks: 816211
  Show dependency treegraph
 
Reported: 2012-02-02 11:52 EST by Larry O'Leary
Modified: 2014-01-02 15:43 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 816211 865160 (view as bug list)
Environment:
Last Closed: 2014-01-02 15:43:09 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 69731 None None None 2012-10-03 15:27:55 EDT

  None (edit)
Description Larry O'Leary 2012-02-02 11:52:42 EST
If a metric collection interval is set to 0 seconds, the result is out of memory conditions for the agent and no metrics being successfully collected.

Version-Release number of selected component (if applicable):
JON 3.0.0
Comment 1 Larry O'Leary 2012-02-02 14:37:48 EST
Steps to reproduce:
#. Install server and agent
#. Start server and agent
#. Import platform, agent, and server into inventory
#. Navigate to the RHQ Server resource and find its resource ID
#. Obtain an inventory report from the agent for the RHQ Server resource
   inventory --id <resource_id of RHQ server> --xml --export /tmp/inventory.xml
#. In the report, locate the schedule with name jboss.system:type=ServerInfo:ActiveThreadCount and note its schedule-id
#. Select the RHQ Server's Monitoring tab
#. Select the Schedules sub-tab
#. Select the Active Thread Count metric and set its collection interval to 1 minute
#. Restart the server and agent to force the new collection interval to take affect
#. Confirm the Active Thread Count metric is being collected every minute
#. Navigate to the Test SQL page: http://localhost:7080/coregui/#Test/ServerAccess/SQL
#. Execute the following SQL:
   UPDATE rhq_measurement_sched SET coll_interval=0 WHERE id=<schedule-id-from-above>;
   Number of Rows Affected should be 1
#. Restart server and agent (agent should be started in debug mode)


Actual Result:
Agent log is filled with hundreds of thousands of: DEBUG [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Measurement collection is falling behind... Missed requested time by [211309ms]
No metrics are being collected for any of the scheduled metrics

Expected Result:
No Measurement collection is falling behind... Missed requested time by [211309ms] log messages in agent debug log
ActiveThreadCount metric should not be collected and a warning should appear in agent log stating that the collection interval is invalid so <resource id> metric <metric id> will not be collected
Other metrics continue to be collected as per their schedule.
Comment 2 Charles Crouch 2012-02-06 11:38:55 EST
The agent should support 0 or -ve collection intervals.
Comment 3 Charles Crouch 2012-02-06 11:40:55 EST
We need to spend sometime investigating how this could have gotten set to 0, without going into the DB
Comment 4 Larry O'Leary 2012-02-06 11:46:48 EST
I attempted to reproduce the issue which would have caused 0 without success. Additional information from the user who reported this seems to point to a possible inventory sync issue between server and agent. The reason I say this is the user reported the collection interval seemed to be correct from the server side but the export of inventory from the agent revealed the 0 interval. The user then executed the inventory sync command from the agent prompt and the agent side inventory (export agent-side inventory report) shows the schedule fixed itself.
Comment 5 Jay Shaughnessy 2012-04-09 09:46:50 EDT
I've also come up empty with respect to recreating this scenario. Although
the issue states that the DB values were valid, and the situation can
self-correct with a sync, I started by looking at whether we could somehow
store a bad value and propagate it to an agent. Although very unlikely it
seemed maybe possible. The following commit resulted to protect against it:

commit 6ff5d43b1b864bfecd7a352f7c72bc81bff56642

    Related to Bug 786889 which deals with invalid (low) schedule intervals
    reaching the agent.  Although it seems unlikely that the bad interval
    was stored in the db, there exist ways to store a bad value and it can
    in fact reach the agent.  Don't let that happen by fixing the
    interval verification in the SLSB code and blocking all values < the
    minimum collection interval.


Server-side we do some things with metric schedule setting that is a bit
hairy, whether it be template/group/resource, we do use negative, 0 and
positive values to indicate various enable/disable scenarios. Various attempts
to get these bad values to the agent didn't pan out, I don't see a way to 
make that happen.

So it seems the issue must somewhere be agent side but again, I don't see
it, so instead I plan to catch/report/correct a bad value being set.  At least
a prevention and copious logging may help us track this down.
Comment 6 Jay Shaughnessy 2012-04-09 15:06:59 EDT
commit aa65805371fd2f1901cc46aeff55e6866ad54eb7
commit f234f3ac5f157341cd2535342ecacf97f66cef1e (a tweak to the original commit)

Catch any attempt to set a metric collection interval to an invalid value.
If found, set it to 20 minutes and report the problem in the log as an
ERROR, with a lot of supporting info to help track down the cause.
Comment 7 Jay Shaughnessy 2012-04-17 10:40:18 EDT
Back to ON_DEV, I recreated on 3.0.0.  Investigating.

To reproduce I used AS4 WAR type and updated two out-of-box disabled metrics
simultaneously to enabled, not changing the interval.

I doubt several of the details above are relevant, most likely there is a
general issue with template metric manipulation.
Comment 8 Jay Shaughnessy 2012-04-25 14:50:22 EDT
The fixes for the root cause, the template-level metric enable/disable issue,
are in and tied to other bzs for the various releases.

The code changes here are still relevant, as they can protect against
bad metric collection intervals in general.  But given the other fixes
I don't expect this code to trigger unless there is a future
regression.

If forced in a testing scenario, ensure the 20 minute interval is used and
that the log shows the appropriate ERROR.
Comment 10 Filip Brychta 2012-05-29 11:10:48 EDT
1- followed described scenario
2- following messages found in server log:
2012-05-29 15:34:40,075 WARN  [org.hibernate.util.JDBCExceptionReporter] SQL Error: 0, SQLState: 22012
2012-05-29 15:34:40,075 ERROR [org.hibernate.util.JDBCExceptionReporter] ERROR: division by zero
2012-05-29 15:34:40,131 ERROR [org.rhq.enterprise.server.measurement.MeasurementScheduleManagerBean] MeasurementSchedule data was corrupt: automatically updated 1 resources and 1 to correct the issue; agents were notified

After this, collection interval was reset to 30s but metric was not collected.

3- wait 20min -> metric is being collected with 30s interval

Not sure if this is desired behaviour. Jay wrote in his comment that 20 minute interval should be used.
Comment 11 Charles Crouch 2012-10-30 18:01:54 EDT
Jay 
Can you comment on https://bugzilla.redhat.com/show_bug.cgi?id=786889#c10 and indicate if anything is going awry here.
Comment 12 Jay Shaughnessy 2012-11-15 11:17:49 EST
The messages in comment 10 are related to a completely different codepath and error handling issue.  It is working as expected and is several years old. If interested, see commit 6d7d22a403a784390e11c730820f736ccc6e37c9 by Joe, but it's not related.

I think this issue can be closed and I'm moving it to ON_QA.
Comment 13 Larry O'Leary 2013-09-06 10:32:06 EDT
As this is MODIFIED or ON_QA, setting milestone to ER1.

Note You need to log in before you can comment on or make changes to this bug.