Currently, the agent sends its availability reports every 60 seconds and the server expects to hear from the agent within 2 minutes.
I think we want to lengthen these times to something like 90 seconds and 4 minutes. Note the 90 seconds (on agent side) is configurable and we should be able to configure that 4 minutes on server side.
This change will a) cause less traffic to hit the server (in fact, we reduce the number of avail reports to be processed by 50%) and b) we only backfill agents when they have been silent for 4 minutes giving the agent more time to be able to get an avail report processed on the server side. Backfilling is expensive if the agent is UP so we only want to backfill when we are sure the agent is down.
Perhaps before we backfill, we should have the server try to ping the agent and if the ping succeeds, we shouldn't backfill. Just another test we could do to avoid backfilling when possible.
we need to investigate what is the proper interval should be.
An alternative is to perform some additional checking after 2 minutes of quiet time but before we actually backfill.
Perhaps we can look in our DB for ANY activity from the agent right before we backfill. If we've seen we already processed (within the past 2 minutes) an inventory report, a measurement report, an operation result, a configuration change or other agent-originating message, we can assume the agent is up and just hasn't been able to send us its avail report yet. In this case, we abort the backfill.
1) checkSuspectAgents looks for an avail report that occurred within the past 2 minutes. If nothing then:
2) check to see if the agent has sent us any message in the previous 2m interval (like inventory report, measurement report, operation result, etc). If we DID get such a message from the agent, abort and do not backfill. Otherwise:
3) continue with the normal backfill processing
So step 2) would be new.
thoughts re: making additional queries before backfilling to see if we heard from the agent. we don't want to make more queries - kinda defeats the purpose of wanting to reduce the amount of load on the database that we are trying to do.
We could just change the queit time - this would mean the agent still sends avail reports every 1 minute (thus we still are able to alert within a minute of when resources go down) but the quiet period increase allows us to delay the backfill giving us more time to process avail reports if we need it. This would delay alerting but only in the case when the agent as a whole goes down.
We could increase the backfill quiet time default to 3 or 4 minutes.
let's wait to see what information falls out of charles' testing. this might be pushed to future.
we decided against changing the interval. i'm sure we'll revisit this in the future :) but for now, we will leave this as is.
as expected, we are revisiting.
I think we will make the defaults as 5 minutes for agent avail reporting and 15 minutes for quiet period.
agent avail reporting is every 5 minutes.
server allows a quiet time of 15 minutes.
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1098