Bug 1030111 - Recovery alert cache refresh needs to happen prior to alert notification processing
Recovery alert cache refresh needs to happen prior to alert notification proc...
Status: CLOSED CURRENTRELEASE
Product: JBoss Operations Network
Classification: JBoss
Component: Monitoring - Alerts (Show other bugs)
JON 3.1.2
Unspecified Unspecified
high Severity high
: ER07
: JON 3.2.0
Assigned To: Jay Shaughnessy
Mike Foley
:
Depends On: 1028487
Blocks: 1012435 1030114
  Show dependency treegraph
 
Reported: 2013-11-13 18:06 EST by Larry O'Leary
Modified: 2014-01-02 15:35 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1028487
: 1030114 (view as bug list)
Environment:
Last Closed: 2014-01-02 15:35:02 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
alert.png (161.77 KB, image/png)
2013-12-06 11:09 EST, Armine Hovsepyan
no flags Details
server.log.png (355.49 KB, image/png)
2013-12-06 11:10 EST, Armine Hovsepyan
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 465423 None None None Never

  None (edit)
Description Larry O'Leary 2013-11-13 18:06:29 EST
From a user's environment we can see a 3-4 second delay between when the alert is fired and the cache reload request is made due to the time it takes to execute the alert notifications.

Therefore, to ensure a higher level of alert stability, this issue needs to be fixed in the JBoss ON as well.

+++ This bug was initially created as a clone of Bug #1028487 +++

Currently alert notifications are processed in the same transaction as the alert creation, recovery alert activation, etc.  There are a few issues related to the notif processing taking place in this transaction:

1) It simply extends the length of an already complicated transaction, potentially holding locks, delaying the alert commit, delaying the global cache refresh flags being committed (such that HA servers can pick up on it), etc...

2) Notifications can actually initiate recovery actions such as executing resource operations, invoking CLI scripts, etc.  This should not happen prior to our ability to update the global alert condition cache, which must happen to begin condition matching on activated recovery alert definitions.  Otherwise we risk an actual recovery happening prior to the recovery alert being ready.


The alert notification processing should happen outside of the alert creation transaction and after cache refresh.

--- Additional comment from Jay Shaughnessy on 2013-11-08 10:57:06 EST ---


master commit 28becd282f8cd3ed4327a56ea0c08f8431845dba
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Fri Nov 8 10:42:46 2013 -0500

Restructure SLSB methods (locals only, no remote changes) to process alert
notifications later in the workflow, after the alert is committed and after
we have a chance to update the condition caches (for more reliable recovery
alerting).  Needed to be able to pass back the new alert through the call chain.
Comment 1 Jay Shaughnessy 2013-11-14 11:12:43 EST
Test Case:
The test case for Bug 1030108 is also effective for testing the code paths for this bug.  Additionally, ensure every type of notification is successfully executed.

release/jon3.2.x commit e6320059f3bc44dcc9b5f4b3ca348460a1556e2f
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Thu Nov 14 11:11:55 2013 -0500

 Restructure SLSB methods (locals only, no remote changes) to process alert
 notifications later in the workflow, after the alert is committed and after
 we have a chance to update the condition caches (for more reliable recovery
 alerting).  Needed to be able to pass back the new alert through the call
 chain.

 Cherry-Pick Master: 28becd282f8cd3ed4327a56ea0c08f8431845dba
Comment 2 Simeon Pinder 2013-11-19 10:49:08 EST
Moving to ON_QA as available for testing with new brew build.
Comment 3 Simeon Pinder 2013-11-22 00:14:28 EST
Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.
Comment 4 Armine Hovsepyan 2013-12-06 11:09:18 EST
Created attachment 833674 [details]
alert.png
Comment 5 Armine Hovsepyan 2013-12-06 11:10:10 EST
Created attachment 833675 [details]
server.log.png
Comment 6 Armine Hovsepyan 2013-12-06 11:11:44 EST
verified in 3.2 GA - cache reload is called after alert fired and before the recovery alert

Note You need to log in before you can comment on or make changes to this bug.