Bug 534281 (RHQ-1092)

Summary: be able to configure the availability "quiet time" before backfilling
Product: [Other] RHQ Project Reporter: John Mazzitelli <mazz>
Component: MonitoringAssignee: John Mazzitelli <mazz>
Status: CLOSED NEXTRELEASE QA Contact: Corey Welton <cwelton>
Severity: medium Docs Contact:
Priority: high    
Version: unspecifiedKeywords: Improvement
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
URL: http://jira.rhq-project.org/browse/RHQ-1092
Whiteboard:
Fixed In Version: 1.2 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On: 534286    
Bug Blocks:    

Description John Mazzitelli 2008-11-08 19:30:00 EST
When we have very large environments, it may be that we cannot process avail reports fast enough before our checkForSuspectAgents job backfills the agent.  I see this every now and again - where resources go red then green and ping pong back and forth between up and down.

We should be able to configuration the "quiet time" that we allow before backfilling.  Currently, its hardcoded to 2 minutes and the agent sends the avail messages every 1 minute.  The agent is configurable - set "rhq.agent.plugins.availability-scan.period-secs" to have it send avail reports faster or slower than 1 minute.  We currently have a hack to configure the server (set a prop in rhq-server.properties - see below).  We should either consider putting a value in RHQ_SYSTEM_CONFIG so its the same across the cloud (and changable via Admin UI page) or we could put some kind of smarts in the server so it could say something like, "I'm getting clobbered with alot of agent messages/inventory reports/etc - I'll let agents slide another 2 minutes on avail reports - so I won't backfill unless I don't hear from an agent in 4 minutes".  The server could then readjust later when it catches up, back to the 2-minute backfill quiet time.

In our AgentManagerBean we have:

    @SuppressWarnings("unchecked")
    public void checkForSuspectAgents() {
        if (log.isDebugEnabled())
            log.debug("Checking to see if there are agents that we suspect are down...");

        // TODO [mazz]: make this configurable via SystemManager bean
        long maximumQuietTimeAllowed = 1000L * 60 * 2;
        try {
            String propStr = System.getProperty("rhq.server.agent-max-quiet-time-allowed");
            if (propStr != null) {
                maximumQuietTimeAllowed = Long.parseLong(propStr);
            }
        } catch (Exception e) {
        }
Comment 1 John Mazzitelli 2008-11-08 22:01:38 EST
An alternative is to perform some additional checking after 2 minutes of quiet time but before we actually backfill.

Perhaps we can look in our DB for ANY activity from the agent right before we backfill.  If we've seen we already processed (within the past 2 minutes) an inventory report, a measurement report, an operation result, a configuration change or other agent-originating message, we can assume the agent is up and just hasn't been able to send us its avail report yet.  In this case, we abort the backfill.

So its:

1) checkSuspectAgents looks for an avail report that occurred within the past 2 minutes. If nothing then:
2) check to see if the agent has sent us any message in the previous 2m interval (like inventory report, measurement report, operation result, etc).  If we DID get such a message from the agent, abort and do not backfill.  Otherwise:
3) continue with the normal backfill processing

So step 2) would be new.
Comment 2 John Mazzitelli 2008-11-11 12:25:48 EST
making this critical - we need to at least explore the possibilty to bump up the quiet time interval and avail report interval.
Comment 3 John Mazzitelli 2008-11-12 02:23:58 EST
Admin > Server Config page now allows you to specify the agent max quiet time allowed setting which is what our check-suspect-agent job will use.  therefore, this setting takes affect across the cloud. we no longer support that hidden system property override .
Comment 4 Corey Welton 2008-12-22 22:17:11 EST
QA Verified.
Comment 5 Red Hat Bugzilla 2009-11-10 15:23:45 EST
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1092
This bug relates to RHQ-1303
Comment 6 wes hayutin 2010-02-16 16:07:47 EST
Mass move to component= Monitoring