Bug 534725 - (RHQ-1494) how to handle known/to-be-expected outages? [NEEDINFO]
how to handle known/to-be-expected outages?
Product: RHQ Project
Classification: Other
Component: No Component (Show other bugs)
All All
medium Severity medium (vote)
: ---
: ---
Assigned To: RHQ Project Maintainer
: FutureFeature, Improvement
Depends On:
  Show dependency treegraph
Reported: 2009-02-06 15:51 EST by John Mazzitelli
Modified: 2014-05-02 17:00 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2014-05-02 17:00:10 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
cwelton: needinfo? (jmarques)

Attachments (Terms of Use)

  None (edit)
Description John Mazzitelli 2009-02-06 15:51:00 EST
Suppose you have a managed resource  that require to be recycled periodically (once a day, couple of times a week, etc.).

The recycle happens quickly - but sometimes at night.

Since the recycle happens quickly (typically less than a minute) you might not want to emit an alert (which would typically trigger an email page to on duty support personnel at home - which we don't want to do because its a known outage that does not require help).
Comment 1 John Mazzitelli 2009-02-06 15:53:20 EST
This is potentially talking about altering the availability checks themselves.

Availability checks occur on the agent, inside of the plugin that is monitoring the managed resource.

If you are monitoring JBossAS instances, the jboss-as plugin performs availability checks by connecting to the JBossAS MBeanServer and making sure its still up.

If you are monitoring a PostgreSQL DB (as another example) the postgres plugin will perform a db connection to test that the DB is alive.

In all cases, the availability timeout is 5 seconds (that is, when the plugin container asks each plugin to go do their avail checks, the plugin container gives the plugin 5 seconds to do such a check - if the plugin container does not hear back from the plugin either way, it assumes "down"). Read this JIRA - this is the issue that introduces the 5 seconds: http://jira.rhq-project.org/browse/RHQ-14

Now, there is no way to configure that timeout - it is something that is already in JIRA but nothing yet has been implemented. http://jira.rhq-project.org/browse/RHQ-551

I'm not sure if being able to configure this timeout is going to be 100% what we need though. Because if the resource is really down, that availability check will complete within the timeout range (the plugin will fail-fast and say, "this is down"). At which point, the availability report makes its way to the server and the server will emit an alert. There is no retry mechanism - the availability checks are designed to be very fast - one and done "is this up or down?". No retry mechanisms. The availablity checks are performed, by default, every 1 minute inside the agent - so in effect, THAT is the retry mechanism.

Once the availability makes its way to the server, the only way to suppress the alert is using the dampening features of the alert subsystem.
Comment 2 John Mazzitelli 2009-02-06 15:54:26 EST
Same problem with a slightly different take, is there any way we can disable an alert for a period of time automatically?

At this time, there is only manual disablement of alerts - you can disable alerts, but you have to manually go in and disable it, then manually re-enable when you want it back.
Comment 3 John Mazzitelli 2009-02-06 16:01:21 EST
This kinda indicates the need for supporting scheduled disabled time periods for alerts (e.g. for scheduled maintenace downtimes).

So, for example, we could say, "this resource is expected to be down from 1:00am to 1:15am, ignore all alerts from it during that timeframe".

This assumes the outages occur at the times specified. If the outages are random, this won't help.
Comment 4 Jay Shaughnessy 2009-02-06 16:06:37 EST
This might be a candidate for a (custom) remote api app that can be executed before and after a known outage. as part of the scheduled outage task.
Comment 5 John Mazzitelli 2009-02-06 16:07:48 EST
oooo... you mean integrate with a future CLI / WebServices interface? mmmmmm... interesting. A good use case for a remote API.
Comment 6 Heiko W. Rupp 2009-02-06 16:18:09 EST
This should be part of a much larger schedules / on duty / .... system.

See also the discussions about Administratively DOWN (there is at least one jira about this) and I think I also have a chat log about this with me, Joe and Greg.
Comment 7 Red Hat Bugzilla 2009-11-10 15:34:18 EST
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1494
This bug is related to RHQ-620
Comment 8 wes hayutin 2010-02-16 12:10:14 EST
mass add of key word FutureFeature to help track
Comment 9 Jay Shaughnessy 2014-05-02 17:00:10 EDT
This is handled with the new DISABLED resource state/availability.

Note You need to log in before you can comment on or make changes to this bug.