1029553 – Recovery alerts involving availability may not fire in HA environment

Bug 1029553 - Recovery alerts involving availability may not fire in HA environment

Summary: Recovery alerts involving availability may not fire in HA environment

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Alerts, Core Server
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	RHQ 4.10
Assignee:	Jay Shaughnessy
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1030108
TreeView+	depends on / blocked

Reported:	2013-11-12 15:24 UTC by Jay Shaughnessy
Modified:	2014-04-23 12:30 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Clones:	1030108 (view as bug list)
Environment:
Last Closed:	2014-04-23 12:30:09 UTC
Embargoed:

Attachments	(Terms of Use)

Description Jay Shaughnessy 2013-11-12 15:24:27 UTC

This is an offshoot of Bug 1003132 Comment 15 (copied here):

...This is still more of an issue in an HA environment than previously stated. 

Here's the thing,  consider an HA env with servers S1 and S2.  Agent A monitors resource R and is talking to S1.  R has an avail duration alert def for StaysDown5Minutes.  R goes down.  Agent A reports R down to S1 and S1 schedules a quartz job to fire in 5 minutes.  5 minutes later the quartz job fires, R is still down. So, the duration alert fires, the global cache on the server handling the quartz job is immediately updated and the other server is updated within 30 seconds.

But which server handled the quartz job?  For some reason the assumption has been that it would be S1.  When this was originally designed it didn't matter which server processed the quartz job doing the avail duration check.  Because cache reloads were handled always by the 30s recurring jobs executing on each server. But the work for this issue, where we updated the global cache immediately, I think made a bad assumption that it would be updating on, in the case above, S1 - the same server that scheduled the job.

That is false.  It could be S1 or S2 as Quartz gives no guarantee which clustered scheduler will pick up the job.  So, even with no failover, the server servicing the relevant agent may not get the fast cache reload.

Comment 1 Jay Shaughnessy 2013-11-12 15:35:30 UTC

I think the only solution here is to convert the avail duration checking from a clustered quartz trigger to an EJB Timer.

The free version of Quartz does not provide for the ability to designate a specific server for executing a trigger in a clustered scheduler(this is the TC Where feature that exists in the pay version).  So, the only obvious way to ensure that the avail duration check is executed on the scheduling server, S1 in the example above, is an EJB Timer.  The assumption here is that the relevant agent will still be talking to S1, and therefore that is the server needing the immediate global cache reload. 

There are two weaknesses:

In a failover situation we will still be subject to the <= 30s cache refresh window.  But this is not different than before, and we've done things to make this a very unlikely occurrence, including the standard delayed failover and now deferred notification handling (which means faster setting of the dirty cache flag).

Second, if S1 goes down we'll lose the avail duration alert completely because the EJB Timer will be lost.  This is unfortunate but unlikely, and I think in general a loss of some monitoring when losing a monitoring server, may be expected.  The benefit of proper alerting in this scenario, nearly all the time, I think outweighs the downside.

Comment 2 Jay Shaughnessy 2013-11-12 18:44:12 UTC

master commit 14dca980d0b5794b5c13aebe5add5b61fd068a06
Author: Jay Shaughnessy <jshaughn>
Date:   Tue Nov 12 13:33:32 2013 -0500

 Convert avail duration condition checking from quartz job to EJB Timer to
 ensure that the the job executes on the HA server initiating the job.  This
 server presumably still serves the relevant agent and therefore is the
 server which should have its global cache updated immediately upon firing
 of the alert, in order to quickly get recovery alert conditions into the
 global cache.

 The only downside is that loss of the server (in an HA env) will mean loss
 of the avail duration check, and potential alert.  In a non-HA env the issue
 is moot because there would be no where to execute the quartz job as well.

Comment 3 Heiko W. Rupp 2014-04-23 12:30:09 UTC

Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.

Note You need to log in before you can comment on or make changes to this bug.