Hide Forgot
Description of problem: The addition of a recovery alert element to an alert causes the alert not to fire. Removing this recovery alert allows the alert to fire again. Version-Release number of selected component (if applicable): 3.1.2 with hotfix-03 How reproducible:Always Steps to Reproduce: 1. Import and manage an EAP 5 server 2. Be sure to start the EAP 5 server in terminal window with script: bin/runs.sh -c default 3. Enable Log Events for the server 4. Create 2 Alerts Dummy Alert: -name: Dummy Alert -disabled: true Down Recovery Alert -name: Down Recovery Alert -condition: availability duration: stays down for 1 minute [-notifications: resource operation: start this resource] optional 5. Do a ctrl-c to gracefully stop the EAP 5 server (not a hard stop with kill -9). We want the shutdown logs to trigger the alert and it does after a minute or two. 6. Now start the server again with run.sh -c default 7. Change the 'Down Recovery Alert' to add a recovery alert to enable Dummy Alert 8. Do a ctrl-c to gracefully stop the EAP 5 server. 9. Observer that alert never fires 10. Optionally, take off the recovery alert and it will work again. Actual results: Alert doesn't fire now Expected results: Alert should fire with the addition of an recovery alert on it. Additional info:
I'm not sure if there is really an issue here or not. Perhaps. But also possible is an understanding issue or recovery alerting and/or alert duration conditions. Moreover, there have been some fixes put in place since 3.1.2 that could feasible affect this behavior. So, I'd suggest it be tested against RHQ 4.9 or JON 3.2 before any further investigation. Remember that when using recovery alerts the idea is that you have two alert defs that are mutually exclusively doing condition matching. The problem alert def should be initially enabled and if fired will be disabled and the recovery alert def will be active. If the recovery alert def fires it will then re-enable the problem alert def and go back to sleep until it is needed again. Availability change conditions match only when the relevant change of availability is detected. Availability duration conditions match only when the relevant change in availability is detected, and then the same availability type is set after the duration period expires. In essence it's a "goes down and stays down for X minutes" condition (if using Down avail, for example). So, in the scenario above, I would expect that if the logEvent alert def was created and enabled, and the recovery alert def was defined, then at the time of the ctrl-c the logEvent alert def should fire, be disabled automatically and the recovery alert def should enable. But, it's quite possible that before the recovery alert def is ready to condition match (especially in 3.1.2, this was sped up in 3.2) that the down availability has already been reported. In this situation the "goes down" portion of the avail duration condition will not match. Therefore the recovery alert def will not fire until perhaps the server cycled again completely. Last comment: I'm not exactly sure why there would be a log event alert def that seems to be looking for a shutdown event, and then a recovery alert def for goes down for x minutes. That seems like two defs for basically the same thing. Typically the recovery alert def would be a "goes up" condition.
I tested using the steps described by "Steps to Reproduce" in comment 0 but was unable to see any wrong behavior. It isn't clear what the log events have to do with this as the steps don't seem to describe their use. However, setting up two alert definitions and then later making one recover the other seems to work just fine. One note though, I see that in the test steps the alert to be recovered (Dummy Alert) is disabled to start. Perhaps an enable step was missing in the list but if not, then the behavior would essentially be expected. In other words, the Dummy Alert would never to evaluated because it was disabled and the recovery alert itself (Down Recovery Alert) would never be evaluated because the alert condition of the Dummy Alert never occurred and made Down Recovery Alert eligible for evaluation. If I am missing anything and closing this in error, please provide more information as it related to JBoss ON 3.2.