Red Hat Bugzilla – Bug 534725
how to handle known/to-be-expected outages?
Last modified: 2014-05-02 17:00:10 EDT
Suppose you have a managed resource that require to be recycled periodically (once a day, couple of times a week, etc.).
The recycle happens quickly - but sometimes at night.
Since the recycle happens quickly (typically less than a minute) you might not want to emit an alert (which would typically trigger an email page to on duty support personnel at home - which we don't want to do because its a known outage that does not require help).
This is potentially talking about altering the availability checks themselves.
Availability checks occur on the agent, inside of the plugin that is monitoring the managed resource.
If you are monitoring JBossAS instances, the jboss-as plugin performs availability checks by connecting to the JBossAS MBeanServer and making sure its still up.
If you are monitoring a PostgreSQL DB (as another example) the postgres plugin will perform a db connection to test that the DB is alive.
In all cases, the availability timeout is 5 seconds (that is, when the plugin container asks each plugin to go do their avail checks, the plugin container gives the plugin 5 seconds to do such a check - if the plugin container does not hear back from the plugin either way, it assumes "down"). Read this JIRA - this is the issue that introduces the 5 seconds: http://jira.rhq-project.org/browse/RHQ-14
Now, there is no way to configure that timeout - it is something that is already in JIRA but nothing yet has been implemented. http://jira.rhq-project.org/browse/RHQ-551
I'm not sure if being able to configure this timeout is going to be 100% what we need though. Because if the resource is really down, that availability check will complete within the timeout range (the plugin will fail-fast and say, "this is down"). At which point, the availability report makes its way to the server and the server will emit an alert. There is no retry mechanism - the availability checks are designed to be very fast - one and done "is this up or down?". No retry mechanisms. The availablity checks are performed, by default, every 1 minute inside the agent - so in effect, THAT is the retry mechanism.
Once the availability makes its way to the server, the only way to suppress the alert is using the dampening features of the alert subsystem.
Same problem with a slightly different take, is there any way we can disable an alert for a period of time automatically?
At this time, there is only manual disablement of alerts - you can disable alerts, but you have to manually go in and disable it, then manually re-enable when you want it back.
This kinda indicates the need for supporting scheduled disabled time periods for alerts (e.g. for scheduled maintenace downtimes).
So, for example, we could say, "this resource is expected to be down from 1:00am to 1:15am, ignore all alerts from it during that timeframe".
This assumes the outages occur at the times specified. If the outages are random, this won't help.
This might be a candidate for a (custom) remote api app that can be executed before and after a known outage. as part of the scheduled outage task.
oooo... you mean integrate with a future CLI / WebServices interface? mmmmmm... interesting. A good use case for a remote API.
This should be part of a much larger schedules / on duty / .... system.
See also the discussions about Administratively DOWN (there is at least one jira about this) and I think I also have a chat log about this with me, Joe and Greg.
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1494
This bug is related to RHQ-620
mass add of key word FutureFeature to help track
This is handled with the new DISABLED resource state/availability.