Bug 534286 (RHQ-1098) - make availability report interval longer
Summary: make availability report interval longer
Alias: RHQ-1098
Product: RHQ Project
Classification: Other
Component: Agent
Version: unspecified
Hardware: All
OS: All
medium vote
Target Milestone: ---
: ---
Assignee: John Mazzitelli
QA Contact:
URL: http://jira.rhq-project.org/browse/RH...
Depends On:
Blocks: RHQ-1092 741450
TreeView+ depends on / blocked
Reported: 2008-11-10 17:00 UTC by John Mazzitelli
Modified: 2011-09-26 20:47 UTC (History)
1 user (show)

Fixed In Version: 1.3
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 741450 0 medium CLOSED RFE: Improve Availability Handling (Tracker) 2021-02-22 00:41:40 UTC

Internal Links: 741450

Description John Mazzitelli 2008-11-10 17:00:00 UTC
Currently, the agent sends its availability reports every 60 seconds and the server expects to hear from the agent within 2 minutes.

I think we want to lengthen these times to something like 90 seconds and 4 minutes.  Note the 90 seconds (on agent side) is configurable and we should be able to configure that 4 minutes on server side.

This change will a) cause less traffic to hit the server (in fact, we reduce the number of avail reports to be processed by 50%) and b) we only backfill agents when they have been silent for 4 minutes giving the agent more time to be able to get an avail report processed on the server side.  Backfilling is expensive if the agent is UP so we only want to backfill when we are sure the agent is down.

Perhaps before we backfill, we should have the server try to ping the agent and if the ping succeeds, we shouldn't backfill.  Just another test we could do to avoid backfilling when possible.

Comment 1 John Mazzitelli 2008-11-11 17:26:15 UTC
we need to investigate what is the proper interval should be.

Comment 2 John Mazzitelli 2008-11-12 07:21:11 UTC
An alternative is to perform some additional checking after 2 minutes of quiet time but before we actually backfill.

Perhaps we can look in our DB for ANY activity from the agent right before we backfill. If we've seen we already processed (within the past 2 minutes) an inventory report, a measurement report, an operation result, a configuration change or other agent-originating message, we can assume the agent is up and just hasn't been able to send us its avail report yet. In this case, we abort the backfill.

So its:

1) checkSuspectAgents looks for an avail report that occurred within the past 2 minutes. If nothing then:
2) check to see if the agent has sent us any message in the previous 2m interval (like inventory report, measurement report, operation result, etc). If we DID get such a message from the agent, abort and do not backfill. Otherwise:
3) continue with the normal backfill processing

So step 2) would be new. 

Comment 3 John Mazzitelli 2009-01-20 16:18:04 UTC
thoughts re: making additional queries before backfilling to see if we heard from the agent. we don't want to make more queries - kinda defeats the purpose of wanting to reduce the amount of load on the database that we are trying to do.

We could just change the queit time - this would mean the agent still sends avail reports every 1 minute (thus we still are able to alert within a minute of when resources go down) but the quiet period increase allows us to delay the backfill giving us more time to process avail reports if we need it.  This would delay alerting but only in the case when the agent as a whole goes down.

We could increase the backfill quiet time default to 3 or 4 minutes.

Comment 4 Joseph Marques 2009-02-04 22:10:56 UTC
let's wait to see what information falls out of charles' testing.  this might be pushed to future.

Comment 5 John Mazzitelli 2009-06-08 13:30:15 UTC
we decided against changing the interval. i'm sure we'll revisit this in the future :) but for now, we will leave this as is.

Comment 6 John Mazzitelli 2009-08-12 21:38:15 UTC
as expected, we are revisiting.

I think we will make the defaults as 5 minutes for agent avail reporting and 15 minutes for quiet period.

Comment 7 John Mazzitelli 2009-08-12 22:04:57 UTC
agent avail reporting is every 5 minutes.
server allows a quiet time of 15 minutes.

Comment 8 Red Hat Bugzilla 2009-11-10 20:23:53 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1098

Note You need to log in before you can comment on or make changes to this bug.