878246 – Updated alert defs may not fire in an HA environment

Bug 878246 - Updated alert defs may not fire in an HA environment

Summary: Updated alert defs may not fire in an HA environment

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	High Availability
Sub Component:
Version:	JON 3.1.1
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	JON 3.1.2
Assignee:	Jay Shaughnessy
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:	878224
Blocks:
TreeView+	depends on / blocked

Reported:	2012-11-19 22:49 UTC by Larry O'Leary
Modified:	2018-12-01 17:50 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:	878224
Environment:
Last Closed:	2013-09-11 11:04:41 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	69800	0	None	None	None	Never

Description Larry O'Leary 2012-11-19 22:49:33 UTC

+++ This bug was initially created as a clone of upstream Bug #878224 +++

This is a longstanding but subtle problem that may be becoming more prevalent now that Availability Duration alerting makes availability recovery alert pairings more useful.

In an HA (high availability/multi-server) environment, alert definitions being updated did not have certain condition types updated on every server.  This included the following condition types:
 - Availability
 - Availablity Duration
 - Resource Operation Execution
 - Resource Configuration Execution

Relevant updates involved any condition changes, the condition policy (all/any), alert definition enable and disable, and possibly others.  This implicitly affects recovery alerting which disables and enables alert definitions, if those alert definitions contained condition types as listed above.

The condition caches are properly updated on the HA server node evaluating the alert def update, but not others. So, the problem only occurs when subsequent condition matches would have occurred on the servers that were not properly updated.

In short, stale alert definitions are possible and may fire or not fire as expected.

--- Additional comment from Jay Shaughnessy on 2012-11-19 17:12:14 EST ---


Here is a fairly simple example that reproduce the problem:

1) Create an HA env like:

Server A
 - Agent A connected
   - RHQ Server resource imported
     - some webapp (e.g. ROOT.war. jconsole.war), call it War A
 - GUI A connected
Server B
 - Agent B connected
 - GUI B connected

2) Using GUI A, create a GOES DOWN availability alert on WAR A

   - set it to Disable when fired

3) Wait 30s and then execute the Stop operation on WAR A (any gui)

   - You should see the alert fire and the alert def disable.
   - In the Server A log you should see something like:

   INFO [CacheConsistencyManagerBean] ServerA took [28]ms to reload global cache

4) Execute the Start operation on WAR A (any gui)

5) Using GUI B enable the alert definition. Wait 30s.

   - In the Server B log you should see something like:

   INFO [CacheConsistencyManagerBean] ServerB took [28]ms to reload global cache

   - You will not see this message in the Server A log.

6) Execute the Stop operation on WAR A (any gui)

   - You will see the avail change to DOWN
   - You will not see an alert fire
   - The alert def will not disable

Comment 1 Jay Shaughnessy 2012-12-05 21:25:39 UTC

commit b79e1c1ce301f4f65ad62b32a14be40988a4e090
Author: Jay Shaughnessy <jshaughn>
Date:   Wed Dec 5 16:23:50 2012 -0500

When setting the server status dirty to notify the need for global condition cache
refresh, update *all* servers.  The global condition cache is supposed to be
replicated across HA servers.  Otherwise, different servers will have different
condition sets generating unexpected results.

    Cherry pick of master 8ab939690aefbb6316aca6336c41804f728d290e

Comment 2 Simeon Pinder 2012-12-10 12:40:45 UTC

Moving to ON_QA as available for test in 3.1.2.ER4 or greater: https://brewweb.devel.redhat.com//buildinfo?buildID=246861

Comment 3 Filip Brychta 2012-12-14 16:58:38 UTC

Verified on 3.1.2.ER4

Note You need to log in before you can comment on or make changes to this bug.