Bug 1020956 - Adding recovery alert to existing alert causes alert to not fire anymore
Summary: Adding recovery alert to existing alert causes alert to not fire anymore
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Monitoring - Alerts
Version: JON 3.1.2
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: JON 3.2.1
Assignee: RHQ Project Maintainer
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-10-18 15:19 UTC by Mike Thompson
Modified: 2014-01-07 21:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-01-07 21:12:34 UTC
Type: Bug


Attachments (Terms of Use)

Description Mike Thompson 2013-10-18 15:19:37 UTC
Description of problem:
The addition of a recovery alert element to an alert causes the alert not to fire. Removing this recovery alert allows the alert to fire again.


Version-Release number of selected component (if applicable):
3.1.2 with hotfix-03


How reproducible:Always


Steps to Reproduce:
1. Import and manage an EAP 5 server
2. Be sure to start the EAP 5 server in terminal window with script: bin/runs.sh -c default
3. Enable Log Events for the server
4. Create 2 Alerts

Dummy Alert:
 -name: Dummy Alert
 -disabled: true

Down Recovery Alert
 -name: Down Recovery Alert
 -condition: availability duration: stays down for 1 minute
 [-notifications: resource operation: start this resource] optional

 5. Do a ctrl-c to gracefully stop the EAP 5 server (not a hard stop with kill -9). We want the shutdown logs to trigger the alert and it does after a minute or two.
6. Now start the server again with run.sh -c default
7. Change the 'Down Recovery Alert' to add a recovery alert to enable Dummy Alert
8. Do a ctrl-c to gracefully stop the EAP 5 server.
9. Observer that alert never fires
10. Optionally, take off the recovery alert and it will work again.



Actual results:
Alert doesn't fire now


Expected results:
Alert should fire with the addition of an recovery alert on it.


Additional info:

Comment 1 Jay Shaughnessy 2013-10-22 17:40:35 UTC
I'm not sure if there is really an issue here or not. Perhaps.  But also possible is an understanding issue or recovery alerting and/or alert duration conditions.  Moreover, there have been some fixes put in place since 3.1.2 that could feasible affect this behavior.  So, I'd suggest it be tested against RHQ 4.9 or JON 3.2 before any further investigation.

Remember that when using recovery alerts the idea is that you have two alert defs that are mutually exclusively doing condition matching.  The problem alert def should be initially enabled and if fired will be disabled and the recovery alert def will be active.  If the recovery alert def fires it will then re-enable the problem alert def and go back to sleep until it is needed again.

Availability change conditions match only when the relevant change of availability is detected.  Availability duration conditions match only when the relevant change in availability is detected, and then the same availability type is set after the duration period expires. In essence it's a "goes down and stays down for X minutes" condition (if using Down avail, for example).

So, in the scenario above, I would expect that if the logEvent alert def was created and enabled, and the recovery alert def was defined, then at the time of the ctrl-c the logEvent alert def should fire, be disabled automatically and the recovery alert def should enable.

But, it's quite possible that before the recovery alert def is ready to condition match (especially in 3.1.2, this was sped up in 3.2) that the down availability has already been reported.  In this situation the "goes down" portion of the avail duration condition will not match. Therefore the recovery alert def will not fire until perhaps the server cycled again completely.

Last comment:  I'm not exactly sure why there would be a log event alert def that seems to be looking for a shutdown event, and then a recovery alert def for goes down for x minutes.  That seems like two defs for basically the same thing.  Typically the recovery alert def would be a "goes up" condition.

Comment 3 Larry O'Leary 2014-01-07 21:12:34 UTC
I tested using the steps described by "Steps to Reproduce" in comment 0 but was unable to see any wrong behavior. 

It isn't clear what the log events have to do with this as the steps don't seem to describe their use. However, setting up two alert definitions and then later making one recover the other seems to work just fine. One note though, I see that in the test steps the alert to be recovered (Dummy Alert) is disabled to start. Perhaps an enable step was missing in the list but if not, then the behavior would essentially be expected. In other words, the Dummy Alert would never to evaluated because it was disabled and the recovery alert itself (Down Recovery Alert) would never be evaluated because the alert condition of the Dummy Alert never occurred and made Down Recovery Alert eligible for evaluation.

If I am missing anything and closing this in error, please provide more information as it related to JBoss ON 3.2.


Note You need to log in before you can comment on or make changes to this bug.