Red Hat Bugzilla – Bug 878224
Updated alert defs may not fire in an HA environment
Last modified: 2013-09-03 10:43:16 EDT
This is a longstanding but subtle problem that may be becoming more prevalent now that Availability Duration alerting makes availability recovery alert pairings more useful.
In an HA (high availability/multi-server) environment, alert definitions being updated did not have certain condition types updated on every server. This included the following condition types:
- Availablity Duration
- Resource Operation Execution
- Resource Configuration Execution
Relevant updates involved any condition changes, the condition policy (all/any), alert definition enable and disable, and possibly others. This implicitly affects recovery alerting which disables and enables alert definitions, if those alert definitions contained condition types as listed above.
The condition caches are properly updated on the HA server node evaluating the alert def update, but not others. So, the problem only occurs when subsequent condition matches would have occurred on the servers that were not properly updated.
In short, stale alert definitions are possible and may fire or not fire as expected.
Here is a fairly simple example that reproduce the problem:
1) Create an HA env like:
- Agent A connected
- RHQ Server resource imported
- some webapp (e.g. ROOT.war. jconsole.war), call it War A
- GUI A connected
- Agent B connected
- GUI B connected
2) Using GUI A, create a GOES DOWN availability alert on WAR A
- set it to Disable when fired
3) Wait 30s and then execute the Stop operation on WAR A (any gui)
- You should see the alert fire and the alert def disable.
- In the Server A log you should see something like:
INFO [CacheConsistencyManagerBean] ServerA took ms to reload global cache
4) Execute the Start operation on WAR A (any gui)
5) Using GUI B enable the alert definition. Wait 30s.
- In the Server B log you should see something like:
INFO [CacheConsistencyManagerBean] ServerB took ms to reload global cache
- You will not see this message in the Server A log.
6) Execute the Stop operation on WAR A (any gui)
- You will see the avail change to DOWN
- You will not see an alert fire
- The alert def will not disable
Author: Jay Shaughnessy <firstname.lastname@example.org>
Date: Mon Nov 19 17:49:07 2012 -0500
When setting the server status dirty to notify the need for global condition cache refresh, update *all* servers. The global condition cache is supposed to be replicated across HA servers. Otherwise, different servers will have different condition sets generating unexpected results.
Bulk closing of issues in old RHQ releases that are in production for a while now.
Please open a new issue when running into an issue.