534281 – (RHQ-1092) be able to configure the availability "quiet time" before backfilling

Bug 534281 (RHQ-1092) - be able to configure the availability "quiet time" before backfilling

Summary: be able to configure the availability "quiet time" before backfilling

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	RHQ-1092
Product:	RHQ Project
Classification:	Other
Component:	Monitoring
Sub Component:
Version:	unspecified
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	John Mazzitelli
QA Contact:	Corey Welton
Docs Contact:
URL:	http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:	RHQ-1098
Blocks:
TreeView+	depends on / blocked

Reported:	2008-11-09 00:30 UTC by John Mazzitelli
Modified:	2010-02-16 21:07 UTC (History)
CC List:	0 users
Fixed In Version:	1.2
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description John Mazzitelli 2008-11-09 00:30:00 UTC

When we have very large environments, it may be that we cannot process avail reports fast enough before our checkForSuspectAgents job backfills the agent.  I see this every now and again - where resources go red then green and ping pong back and forth between up and down.

We should be able to configuration the "quiet time" that we allow before backfilling.  Currently, its hardcoded to 2 minutes and the agent sends the avail messages every 1 minute.  The agent is configurable - set "rhq.agent.plugins.availability-scan.period-secs" to have it send avail reports faster or slower than 1 minute.  We currently have a hack to configure the server (set a prop in rhq-server.properties - see below).  We should either consider putting a value in RHQ_SYSTEM_CONFIG so its the same across the cloud (and changable via Admin UI page) or we could put some kind of smarts in the server so it could say something like, "I'm getting clobbered with alot of agent messages/inventory reports/etc - I'll let agents slide another 2 minutes on avail reports - so I won't backfill unless I don't hear from an agent in 4 minutes".  The server could then readjust later when it catches up, back to the 2-minute backfill quiet time.

In our AgentManagerBean we have:

    @SuppressWarnings("unchecked")
    public void checkForSuspectAgents() {
        if (log.isDebugEnabled())
            log.debug("Checking to see if there are agents that we suspect are down...");

        // TODO [mazz]: make this configurable via SystemManager bean
        long maximumQuietTimeAllowed = 1000L * 60 * 2;
        try {
            String propStr = System.getProperty("rhq.server.agent-max-quiet-time-allowed");
            if (propStr != null) {
                maximumQuietTimeAllowed = Long.parseLong(propStr);
            }
        } catch (Exception e) {
        }

Comment 1 John Mazzitelli 2008-11-09 03:01:38 UTC

An alternative is to perform some additional checking after 2 minutes of quiet time but before we actually backfill.

Perhaps we can look in our DB for ANY activity from the agent right before we backfill.  If we've seen we already processed (within the past 2 minutes) an inventory report, a measurement report, an operation result, a configuration change or other agent-originating message, we can assume the agent is up and just hasn't been able to send us its avail report yet.  In this case, we abort the backfill.

So its:

1) checkSuspectAgents looks for an avail report that occurred within the past 2 minutes. If nothing then:
2) check to see if the agent has sent us any message in the previous 2m interval (like inventory report, measurement report, operation result, etc).  If we DID get such a message from the agent, abort and do not backfill.  Otherwise:
3) continue with the normal backfill processing

So step 2) would be new.

Comment 2 John Mazzitelli 2008-11-11 17:25:48 UTC

making this critical - we need to at least explore the possibilty to bump up the quiet time interval and avail report interval.

Comment 3 John Mazzitelli 2008-11-12 07:23:58 UTC

Admin > Server Config page now allows you to specify the agent max quiet time allowed setting which is what our check-suspect-agent job will use.  therefore, this setting takes affect across the cloud. we no longer support that hidden system property override .

Comment 4 Corey Welton 2008-12-23 03:17:11 UTC

QA Verified.

Comment 5 Red Hat Bugzilla 2009-11-10 20:23:45 UTC

This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1092
This bug relates to RHQ-1303

Comment 6 wes hayutin 2010-02-16 21:07:47 UTC

Mass move to component= Monitoring

Note You need to log in before you can comment on or make changes to this bug.