When we have very large environments, it may be that we cannot process avail reports fast enough before our checkForSuspectAgents job backfills the agent. I see this every now and again - where resources go red then green and ping pong back and forth between up and down. We should be able to configuration the "quiet time" that we allow before backfilling. Currently, its hardcoded to 2 minutes and the agent sends the avail messages every 1 minute. The agent is configurable - set "rhq.agent.plugins.availability-scan.period-secs" to have it send avail reports faster or slower than 1 minute. We currently have a hack to configure the server (set a prop in rhq-server.properties - see below). We should either consider putting a value in RHQ_SYSTEM_CONFIG so its the same across the cloud (and changable via Admin UI page) or we could put some kind of smarts in the server so it could say something like, "I'm getting clobbered with alot of agent messages/inventory reports/etc - I'll let agents slide another 2 minutes on avail reports - so I won't backfill unless I don't hear from an agent in 4 minutes". The server could then readjust later when it catches up, back to the 2-minute backfill quiet time. In our AgentManagerBean we have: @SuppressWarnings("unchecked") public void checkForSuspectAgents() { if (log.isDebugEnabled()) log.debug("Checking to see if there are agents that we suspect are down..."); // TODO [mazz]: make this configurable via SystemManager bean long maximumQuietTimeAllowed = 1000L * 60 * 2; try { String propStr = System.getProperty("rhq.server.agent-max-quiet-time-allowed"); if (propStr != null) { maximumQuietTimeAllowed = Long.parseLong(propStr); } } catch (Exception e) { }
An alternative is to perform some additional checking after 2 minutes of quiet time but before we actually backfill. Perhaps we can look in our DB for ANY activity from the agent right before we backfill. If we've seen we already processed (within the past 2 minutes) an inventory report, a measurement report, an operation result, a configuration change or other agent-originating message, we can assume the agent is up and just hasn't been able to send us its avail report yet. In this case, we abort the backfill. So its: 1) checkSuspectAgents looks for an avail report that occurred within the past 2 minutes. If nothing then: 2) check to see if the agent has sent us any message in the previous 2m interval (like inventory report, measurement report, operation result, etc). If we DID get such a message from the agent, abort and do not backfill. Otherwise: 3) continue with the normal backfill processing So step 2) would be new.
making this critical - we need to at least explore the possibilty to bump up the quiet time interval and avail report interval.
Admin > Server Config page now allows you to specify the agent max quiet time allowed setting which is what our check-suspect-agent job will use. therefore, this setting takes affect across the cloud. we no longer support that hidden system property override .
QA Verified.
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1092 This bug relates to RHQ-1303
Mass move to component= Monitoring