Bug 1029553

Summary: Recovery alerts involving availability may not fire in HA environment
Product: [Other] RHQ Project Reporter: Jay Shaughnessy <jshaughn>
Component: Alerts, Core ServerAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: hrupp
Target Milestone: GA   
Target Release: RHQ 4.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1030108 (view as bug list) Environment:
Last Closed: 2014-04-23 12:30:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1030108    

Description Jay Shaughnessy 2013-11-12 15:24:27 UTC
This is an offshoot of Bug 1003132 Comment 15 (copied here):

...This is still more of an issue in an HA environment than previously stated. 

Here's the thing,  consider an HA env with servers S1 and S2.  Agent A monitors resource R and is talking to S1.  R has an avail duration alert def for StaysDown5Minutes.  R goes down.  Agent A reports R down to S1 and S1 schedules a quartz job to fire in 5 minutes.  5 minutes later the quartz job fires, R is still down. So, the duration alert fires, the global cache on the server handling the quartz job is immediately updated and the other server is updated within 30 seconds.

But which server handled the quartz job?  For some reason the assumption has been that it would be S1.  When this was originally designed it didn't matter which server processed the quartz job doing the avail duration check.  Because cache reloads were handled always by the 30s recurring jobs executing on each server. But the work for this issue, where we updated the global cache immediately, I think made a bad assumption that it would be updating on, in the case above, S1 - the same server that scheduled the job.

That is false.  It could be S1 or S2 as Quartz gives no guarantee which clustered scheduler will pick up the job.  So, even with no failover, the server servicing the relevant agent may not get the fast cache reload.

Comment 1 Jay Shaughnessy 2013-11-12 15:35:30 UTC
I think the only solution here is to convert the avail duration checking from a clustered quartz trigger to an EJB Timer.

The free version of Quartz does not provide for the ability to designate a specific server for executing a trigger in a clustered scheduler(this is the TC Where feature that exists in the pay version).  So, the only obvious way to ensure that the avail duration check is executed on the scheduling server, S1 in the example above, is an EJB Timer.  The assumption here is that the relevant agent will still be talking to S1, and therefore that is the server needing the immediate global cache reload. 

There are two weaknesses:

In a failover situation we will still be subject to the <= 30s cache refresh window.  But this is not different than before, and we've done things to make this a very unlikely occurrence, including the standard delayed failover and now deferred notification handling (which means faster setting of the dirty cache flag).

Second, if S1 goes down we'll lose the avail duration alert completely because the EJB Timer will be lost.  This is unfortunate but unlikely, and I think in general a loss of some monitoring when losing a monitoring server, may be expected.  The benefit of proper alerting in this scenario, nearly all the time, I think outweighs the downside.

Comment 2 Jay Shaughnessy 2013-11-12 18:44:12 UTC
master commit 14dca980d0b5794b5c13aebe5add5b61fd068a06
Author: Jay Shaughnessy <jshaughn>
Date:   Tue Nov 12 13:33:32 2013 -0500

 Convert avail duration condition checking from quartz job to EJB Timer to
 ensure that the the job executes on the HA server initiating the job.  This
 server presumably still serves the relevant agent and therefore is the
 server which should have its global cache updated immediately upon firing
 of the alert, in order to quickly get recovery alert conditions into the
 global cache.

 The only downside is that loss of the server (in an HA env) will mean loss
 of the avail duration check, and potential alert.  In a non-HA env the issue
 is moot because there would be no where to execute the quartz job as well.

Comment 3 Heiko W. Rupp 2014-04-23 12:30:09 UTC
Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.