Bug 534725 (RHQ-1494) - how to handle known/to-be-expected outages?
Summary: how to handle known/to-be-expected outages?
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: RHQ-1494
Product: RHQ Project
Classification: Other
Component: No Component
Version: unspecified
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: ---
Assignee: RHQ Project Maintainer
QA Contact:
URL: http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-02-06 20:51 UTC by John Mazzitelli
Modified: 2023-09-14 01:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-05-02 21:00:10 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 534334 0 low CLOSED Add option to clear (delete) Dashboard "Recent Alerts", "Recent Operations", "Scheduled Operations", "Problem Resources"... 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 741450 0 medium CLOSED RFE: Improve Availability Handling (Tracker) 2021-02-22 00:41:40 UTC

Internal Links: 535340 741450

Description John Mazzitelli 2009-02-06 20:51:00 UTC
Suppose you have a managed resource  that require to be recycled periodically (once a day, couple of times a week, etc.).

The recycle happens quickly - but sometimes at night.

Since the recycle happens quickly (typically less than a minute) you might not want to emit an alert (which would typically trigger an email page to on duty support personnel at home - which we don't want to do because its a known outage that does not require help).

Comment 1 John Mazzitelli 2009-02-06 20:53:20 UTC
This is potentially talking about altering the availability checks themselves.

Availability checks occur on the agent, inside of the plugin that is monitoring the managed resource.

If you are monitoring JBossAS instances, the jboss-as plugin performs availability checks by connecting to the JBossAS MBeanServer and making sure its still up.

If you are monitoring a PostgreSQL DB (as another example) the postgres plugin will perform a db connection to test that the DB is alive.

In all cases, the availability timeout is 5 seconds (that is, when the plugin container asks each plugin to go do their avail checks, the plugin container gives the plugin 5 seconds to do such a check - if the plugin container does not hear back from the plugin either way, it assumes "down"). Read this JIRA - this is the issue that introduces the 5 seconds: http://jira.rhq-project.org/browse/RHQ-14

Now, there is no way to configure that timeout - it is something that is already in JIRA but nothing yet has been implemented. http://jira.rhq-project.org/browse/RHQ-551

I'm not sure if being able to configure this timeout is going to be 100% what we need though. Because if the resource is really down, that availability check will complete within the timeout range (the plugin will fail-fast and say, "this is down"). At which point, the availability report makes its way to the server and the server will emit an alert. There is no retry mechanism - the availability checks are designed to be very fast - one and done "is this up or down?". No retry mechanisms. The availablity checks are performed, by default, every 1 minute inside the agent - so in effect, THAT is the retry mechanism.

Once the availability makes its way to the server, the only way to suppress the alert is using the dampening features of the alert subsystem.

Comment 2 John Mazzitelli 2009-02-06 20:54:26 UTC
Same problem with a slightly different take, is there any way we can disable an alert for a period of time automatically?

At this time, there is only manual disablement of alerts - you can disable alerts, but you have to manually go in and disable it, then manually re-enable when you want it back.


Comment 3 John Mazzitelli 2009-02-06 21:01:21 UTC
This kinda indicates the need for supporting scheduled disabled time periods for alerts (e.g. for scheduled maintenace downtimes).

So, for example, we could say, "this resource is expected to be down from 1:00am to 1:15am, ignore all alerts from it during that timeframe".

This assumes the outages occur at the times specified. If the outages are random, this won't help.

Comment 4 Jay Shaughnessy 2009-02-06 21:06:37 UTC
This might be a candidate for a (custom) remote api app that can be executed before and after a known outage. as part of the scheduled outage task.

Comment 5 John Mazzitelli 2009-02-06 21:07:48 UTC
oooo... you mean integrate with a future CLI / WebServices interface? mmmmmm... interesting. A good use case for a remote API.

Comment 6 Heiko W. Rupp 2009-02-06 21:18:09 UTC
This should be part of a much larger schedules / on duty / .... system.

See also the discussions about Administratively DOWN (there is at least one jira about this) and I think I also have a chat log about this with me, Joe and Greg.

Comment 7 Red Hat Bugzilla 2009-11-10 20:34:18 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1494
This bug is related to RHQ-620


Comment 8 wes hayutin 2010-02-16 17:10:14 UTC
mass add of key word FutureFeature to help track

Comment 9 Jay Shaughnessy 2014-05-02 21:00:10 UTC
This is handled with the new DISABLED resource state/availability.

Comment 10 Red Hat Bugzilla 2023-09-14 01:18:40 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.