Bug 1030108 - Recovery alerts involving availability may not fire in HA environment
Summary: Recovery alerts involving availability may not fire in HA environment
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Core Server, Monitoring - Alerts
Version: JON 3.1.2
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ER07
: JON 3.2.0
Assignee: Jay Shaughnessy
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On: 1029553
Blocks: 1012435 1030110
TreeView+ depends on / blocked
 
Reported: 2013-11-13 22:58 UTC by Larry O'Leary
Modified: 2018-12-03 20:40 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1029553
: 1030110 (view as bug list)
Environment:
Last Closed: 2014-01-02 20:34:16 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
recoveryAlert (145.69 KB, image/png)
2013-12-04 15:33 UTC, Armine Hovsepyan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 535273 0 None None None Never

Description Larry O'Leary 2013-11-13 22:58:47 UTC
This issue is pretty serious in how it relates to availability duration alerts combined with recovery when applied to an HA environment. The combination will be unpredictable alert state and missing recovery alerts.

+++ This bug was initially created as a clone of Bug #1029553 +++

This is an offshoot of Bug 1003132 Comment 15 (copied here):

...This is still more of an issue in an HA environment than previously stated. 

Here's the thing,  consider an HA env with servers S1 and S2.  Agent A monitors resource R and is talking to S1.  R has an avail duration alert def for StaysDown5Minutes.  R goes down.  Agent A reports R down to S1 and S1 schedules a quartz job to fire in 5 minutes.  5 minutes later the quartz job fires, R is still down. So, the duration alert fires, the global cache on the server handling the quartz job is immediately updated and the other server is updated within 30 seconds.

But which server handled the quartz job?  For some reason the assumption has been that it would be S1.  When this was originally designed it didn't matter which server processed the quartz job doing the avail duration check.  Because cache reloads were handled always by the 30s recurring jobs executing on each server. But the work for this issue, where we updated the global cache immediately, I think made a bad assumption that it would be updating on, in the case above, S1 - the same server that scheduled the job.

That is false.  It could be S1 or S2 as Quartz gives no guarantee which clustered scheduler will pick up the job.  So, even with no failover, the server servicing the relevant agent may not get the fast cache reload.

--- Additional comment from Jay Shaughnessy on 2013-11-12 10:35:30 EST ---


I think the only solution here is to convert the avail duration checking from a clustered quartz trigger to an EJB Timer.

The free version of Quartz does not provide for the ability to designate a specific server for executing a trigger in a clustered scheduler(this is the TC Where feature that exists in the pay version).  So, the only obvious way to ensure that the avail duration check is executed on the scheduling server, S1 in the example above, is an EJB Timer.  The assumption here is that the relevant agent will still be talking to S1, and therefore that is the server needing the immediate global cache reload. 

There are two weaknesses:

In a failover situation we will still be subject to the <= 30s cache refresh window.  But this is not different than before, and we've done things to make this a very unlikely occurrence, including the standard delayed failover and now deferred notification handling (which means faster setting of the dirty cache flag).

Second, if S1 goes down we'll lose the avail duration alert completely because the EJB Timer will be lost.  This is unfortunate but unlikely, and I think in general a loss of some monitoring when losing a monitoring server, may be expected.  The benefit of proper alerting in this scenario, nearly all the time, I think outweighs the downside.

--- Additional comment from Jay Shaughnessy on 2013-11-12 13:44:12 EST ---


master commit 14dca980d0b5794b5c13aebe5add5b61fd068a06
Author: Jay Shaughnessy <jshaughn>
Date:   Tue Nov 12 13:33:32 2013 -0500

 Convert avail duration condition checking from quartz job to EJB Timer to
 ensure that the the job executes on the HA server initiating the job.  This
 server presumably still serves the relevant agent and therefore is the
 server which should have its global cache updated immediately upon firing
 of the alert, in order to quickly get recovery alert conditions into the
 global cache.

 The only downside is that loss of the server (in an HA env) will mean loss
 of the avail duration check, and potential alert.  In a non-HA env the issue
 is moot because there would be no where to execute the quartz job as well.

Comment 1 Jay Shaughnessy 2013-11-14 16:05:36 UTC
Test Case:
Testing this is more implicit than explicit.  You want an HA environment (at least two servers) and the to ensure proper execution of avail duration alerting with notifications, and associated avail recovery alerting.  

So you want two alert defs for an EAP resource:

AD-1:  "Stays Down"
  -- Stays DOWN for 5 minutes
  -- disable on firing = true
  -- set up some notifications, like e-mail and/or operations (not a restart)

AD-2: "Stays Down - Recovery"
  -- Goes UP
  -- recovers "Stays Down"

1) bring down the EAP
2) after 5 minutes, bring up the EAP

The alert should fire, the notifications should execute and the recovery alert should fire.

Note that this is not the exact replication for the customer problem, which is hard to reproduce, happening only inconsistently in a high volume environment. See Larry for more details.  But this test should at least ensure that there are no regressions.


release/jon3.2.x commit a2320de3b155a1f2bf69b516dd0e61a1b02d73c0
Auhor: Jay Shaughnessy <jshaughn>
Date:   Thu Nov 14 10:54:09 2013 -0500

 Convert avail duration condition checking from quartz job to EJB Timer to
 ensure that the the job executes on the HA server initiating the job.  This
 server presumably still serves the relevant agent and therefore is the
 server which should have its global cache updated immediately upon firing
 of the alert, in order to quickly get recovery alert conditions into the
 global cache.

 The only downside is that loss of the server (in an HA env) will mean loss
 of the avail duration check, and potential alert.  In a non-HA env the issue
 is moot because there would be no where to execute the quartz job as well.

 Cherry-Pick Master: 14dca980d0b5794b5c13aebe5add5b61fd068a06

Comment 2 Simeon Pinder 2013-11-19 15:48:41 UTC
Moving to ON_QA as available for testing with new brew build.

Comment 3 Larry O'Leary 2013-11-21 15:30:27 UTC
Please note that the following steps are what was used to reproduce this issue in 3.1.2. However, as indicated in comment #1 they really won't be as applicable to 3.2+ as the job scheduler has been changed from Quartz to an EJB timer. We should still test the scenario but do not expect to see the same behavior (log messages) in 3.2+ as the availability duration job should always run on the same server that scheduled it. The following instructions are just for historical purposes to capture the reason for this fix.

1.  Install and start EAP 6 standalone server.
2.  Install a two server (HA) JON 3.1.2 system.
4.  Start server-02 of JON HA system and wait for it to come up.
5.  Start server-01 of JON HA system and wait for it to come up.
    Be sure server 2 is started first so Quartz is running there.
6.  Start agent in foreground.
7.  Import EAP 6 standalone server into inventory and configure connection settings.
8.  Create new _stays down for 2 minutes_ alert definition for EAP resource:

    *Name*: `Alert - Profile Down`
    *Condition*:
        *Fire alert when*:          _ANY_
        *Condition Type*:           _Availability Duration_
        *Availability Duration*:    _Stays Down_
        *Duration*:                 `2` _minutes_
    *Recovery*:
        *Disable When Fired*:   _Yes_

9.  Create new _recovery_ alert definition for EAP default resource:

    *Name*: `Recovery - Profile Down`
    *Condition*:
        *Fire alert when*:  _ANY_
        *Condition Type*:   _Availability Change_
        *Availability*:     _Goes up_
    *Recovery*:
        *Recovery Alert*:   _Alert - Profile Down_

10. From outside of JBoss ON, shutdown EAP server.
11. From agent prompt, execute avail -f
12. Verify EAP availability shows DOWN.
13. Wait approximately 1 minute and 45 seconds.
14. Start EAP server from outside of JON.
15. Wait approximately 15 seconds
16. From agent prompt, execute avail -f



After shutdown of EAP and availability report showing EAP is down (step 11), server-01's log with show a message similar to:

    2013-11-20 18:20:03,792 DEBUG [org.rhq.enterprise.server.measurement.AvailabilityManagerBean] Agent [Agent BZ1025566 on jboss.example.com]: processing availability report of size: 107
    2013-11-20 18:20:04,213 DEBUG [org.rhq.enterprise.server.alert.engine.model.AvailabilityDurationCacheElement] Scheduling availability duration job for [Nov 20, 2013 6:22:04 PM]

Pay special attention to the time displayed in the "Scheduling availability duration job for" message. Approximately 10 to 15 seconds before this time, you will want to start EAP back up (step 14).

About 4 seconds before the execution of the availability duration job, you will invoke a full availability scan on the agent (step 16). This will result in a new availability report being sent to server-01.

    2013-11-20 18:22:10,887 DEBUG [org.rhq.enterprise.server.measurement.AvailabilityManagerBean] Agent [Agent BZ1025566 on jboss.example.com]: processing availability report of size: 107

For the timing to work out:
-   the availability report must happen after the availability duration job is scheduled to execute
-   the availability duration job must execute before the cache is reloaded on server-01
-   the availability report must trigger or happen before the cache reload on server-01

In the test scenario:
-   The agent is connected to server-01
-   Quartz is running on server-02
-   At 18:20:03,792 server-01 marked EAP as DOWN
-   At 18:20:04,213 server-01 scheduled a job on server-02 (Quartz) to check the availability duration at 18:22:04 (2 minutes)
-   At 18:22:04,263 server-02 executed the availability duration job as scheduled and detected that EAP was still DOWN and fired an alert.
-   At 18:22:04,562 server-02 re-loaded its cache as a result of the alert firing
-   At 18:22:10,887 server-01 marked EAP as UP
-   At 18:22:12,857 server-01 re-loaded its cache as a result of the availability change

Because server-01 didn't have an opportunity to reload its cache between server-02 firing the availability duration alert and EAP being marked as UP, server-01 did not know the availability duration alert had been fired and the recovery alert required evaluation.

server-01:
    2013-11-20 18:20:03,792 DEBUG [org.rhq.enterprise.server.measurement.AvailabilityManagerBean] Agent [Agent BZ1025566 on jboss.example.com]: processing availability report of size: 107
    2013-11-20 18:20:04,213 DEBUG [org.rhq.enterprise.server.alert.engine.model.AvailabilityDurationCacheElement] Scheduling availability duration job for [Nov 20, 2013 6:22:04 PM]
    2013-11-20 18:20:04,250 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Check Availability[size=107] - AlertConditionCacheStats[ created=0, updated=0, deleted=0, matched=0, age=42ms ]
    2013-11-20 18:20:04,251 DEBUG [org.rhq.enterprise.server.measurement.AvailabilityManagerBean] mergeAvailabilityReport: AlertConditionCacheStats[ created=0, updated=0, deleted=0, matched=0, age=43ms ]
    2013-11-20 18:20:12,829 DEBUG [org.rhq.enterprise.server.cloud.instance.CacheConsistencyManagerBean] Global cache does not need reloading
    2013-11-20 18:20:12,832 DEBUG [org.rhq.enterprise.server.cloud.instance.CacheConsistencyManagerBean] No agent caches need reloading
    ...
    2013-11-20 18:22:10,887 DEBUG [org.rhq.enterprise.server.measurement.AvailabilityManagerBean] Agent [Agent BZ1025566 on jboss.example.com]: processing availability report of size: 107
    2013-11-20 18:22:11,320 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Check Availability[size=107] - AlertConditionCacheStats[ created=0, updated=0, deleted=0, matched=0, age=0ms ]
    2013-11-20 18:22:11,321 DEBUG [org.rhq.enterprise.server.measurement.AvailabilityManagerBean] mergeAvailabilityReport: AlertConditionCacheStats[ created=0, updated=0, deleted=0, matched=0, age=1ms ]
    2013-11-20 18:22:12,857 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Start reloading global cache
    2013-11-20 18:22:12,858 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loading Global Condition Cache...
    2013-11-20 18:22:12,866 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 1 Alert Condition Composites of type 'Resource Availability'
    2013-11-20 18:22:12,872 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Control Action'
    2013-11-20 18:22:12,878 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Resource Configuration Property Value Change'
    2013-11-20 18:22:12,882 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Resource Availability for Duration'
    2013-11-20 18:22:12,883 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded Global Condition Cache
    2013-11-20 18:22:12,883 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Finished reloading global cache
    2013-11-20 18:22:12,883 INFO  [org.rhq.enterprise.server.cloud.instance.CacheConsistencyManagerBean] server-01 took [26]ms to reload global cache
    2013-11-20 18:22:12,886 DEBUG [org.rhq.enterprise.server.cloud.instance.CacheConsistencyManagerBean] Agent[id=10001] is stale 
    2013-11-20 18:22:12,886 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Start reloading cache for agent[id=10001]
    2013-11-20 18:22:12,887 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loading Alert Condition Caches for agent[id=10001]...
    2013-11-20 18:22:12,894 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Measurement Threshold'
    2013-11-20 18:22:12,904 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Measurement Baseline'
    2013-11-20 18:22:12,909 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Measurement Value Change'
    2013-11-20 18:22:12,924 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Measurement Trait'
    2013-11-20 18:22:12,930 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Log Event'
    2013-11-20 18:22:12,935 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Drift Detected'
    2013-11-20 18:22:12,942 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Measurement Value Range'
    2013-11-20 18:22:12,942 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded Alert Condition Caches for agent[id=10001]
    2013-11-20 18:22:12,942 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Finished reloading cache for agent[id=10001]
    2013-11-20 18:22:12,942 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Reloaded agent[id=10001] cache

server-02:
    2013-11-20 18:22:04,263 DEBUG [org.rhq.enterprise.server.alert.engine.model.AvailabilityDurationCacheElement] comparing DOWN against AvailabilityDurationCacheElement[ alertConditionTriggerId=10001, alertConditionOperator=AVAIL_DURATION_DOWN(120), alertConditionValue=UNKNOWN ] match
    2013-11-20 18:22:04,263 DEBUG [org.rhq.enterprise.server.alert.engine.model.AvailabilityDurationCacheElement] element updated: AvailabilityDurationCacheElement[ alertConditionTriggerId=10001, alertConditionOperator=AVAIL_DURATION_DOWN(120), alertConditionValue=DOWN ]
    2013-11-20 18:22:04,364 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Check AvailabilityDuration[size=1] - AlertConditionCacheStats[ created=0, updated=0, deleted=0, matched=1, age=101ms ]
    2013-11-20 18:22:04,365 DEBUG [org.rhq.enterprise.server.scheduler.jobs.AlertAvailabilityDurationJob] AlertAvailabilityDurationJob: AlertConditionCacheStats[ created=0, updated=0, deleted=0, matched=1, age=102ms ]
    2013-11-20 18:22:04,374 DEBUG [org.rhq.enterprise.server.alert.engine.jms.AlertConditionConsumerBean] Received message: ActiveAlertConditionMessage[ value= DOWN, AbstractAlertConditionMessage[ alertConditionId=10001, timestamp=1384993324263 ] ]
    2013-11-20 18:22:04,545 DEBUG [org.rhq.enterprise.server.alert.engine.jms.AlertConditionConsumerBean] Checking for cache reload due to alert firing
    2013-11-20 18:22:04,562 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Start reloading global cache
    2013-11-20 18:22:04,563 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loading Global Condition Cache...
    2013-11-20 18:22:04,573 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 1 Alert Condition Composites of type 'Resource Availability'
    2013-11-20 18:22:04,594 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Control Action'
    2013-11-20 18:22:04,603 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Resource Configuration Property Value Change'
    2013-11-20 18:22:04,611 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded 0 Alert Condition Composites of type 'Resource Availability for Duration'
    2013-11-20 18:22:04,611 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Loaded Global Condition Cache
    2013-11-20 18:22:04,611 DEBUG [org.rhq.enterprise.server.alert.engine.internal.AlertConditionCacheCoordinator] Finished reloading global cache
    2013-11-20 18:22:04,612 INFO  [org.rhq.enterprise.server.cloud.instance.CacheConsistencyManagerBean] server-02 took [50]ms to reload global cache
    2013-11-20 18:22:04,615 DEBUG [org.rhq.enterprise.server.cloud.instance.CacheConsistencyManagerBean] No agent caches need reloading
    2013-11-20 18:22:07,898 DEBUG [org.rhq.enterprise.server.cloud.instance.CacheConsistencyManagerBean] Global cache does not need reloading
    2013-11-20 18:22:07,902 DEBUG [org.rhq.enterprise.server.cloud.instance.CacheConsistencyManagerBean] No agent caches need reloading

Comment 4 Simeon Pinder 2013-11-22 05:14:07 UTC
Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.

Comment 5 Armine Hovsepyan 2013-12-04 15:32:00 UTC
verified in HA env using CR1
screen-shot attached

Comment 6 Armine Hovsepyan 2013-12-04 15:33:22 UTC
Created attachment 832708 [details]
recoveryAlert


Note You need to log in before you can comment on or make changes to this bug.