Bug 1028636

Summary: Alerts get triggered even when dampening rules have not been satisfied due to missing InactiveAlertConditionMessage
Product: [Other] RHQ Project Reporter: Lukas Krejci <lkrejci>
Component: AlertsAssignee: Lukas Krejci <lkrejci>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: hrupp, lkrejci, loleary
Target Milestone: GA   
Target Release: RHQ 4.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1025491 Environment:
Last Closed: 2014-04-23 12:30:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1025491    
Bug Blocks:    

Description Lukas Krejci 2013-11-08 23:03:37 UTC
+++ This bug was initially created as a clone of Bug #1025491 +++

Description of problem:
When an alert definition exists that uses a dampening rule that requires the condition to match three consecutive times, the alert is sometimes fired even though the required count has not been reached.

For example, take the following alert definition:
    
    *   *Name*: `Dampening Alert`
    *   *Condition Type*: _Measurement Absolute Value Threshold_
    *   *Metric*: _Pattern 3 Metric_
    *   *Comparator*: _= (Equal to)_
    *   *Metric Value*: `0`
    *   *Dampening*: _Consecutive_
    *   *Occurrences*: `3`

The expectation is that Pattern 3 Metric must be equal to 0, 3 consecutive times. However, the alert is triggered when the following pattern is seen:

    0.05000050000500005
    0.049999333342222105
    0
    0.04999966666888887
    0.04999966666888887
    0
    0.049997666775550474
    0.050003166867234924
    0

This seems very similar to Bug 698320 - Dampening on consecutive count does not work for equals comparision but as this was already fixed in JBoss ON 3.1.1 and local testing using the test case identified in bug 698320 confirms that the issue is no longer present in 3.1.1, I do not believe this is the same issue..

Instead, in this case it seems that the unmatched condition is not triggering the InactiveAlertConditionMessage. It is not clear on why this message would not be generated or handled.

Version-Release number of selected component (if applicable):
4.4.0.JON311GA

Additional info:
During my unsuccessful attempt to reproduce this issue, some things that stand out as odd when comparing my test results to what was provided in the original case is:

 - Log seems to indicate that the received condition message is always ActiveAlertConditionMessage
 
   In local testing it appears that this should be InactiveAlertConditionMessage after the previous non-matching condition occurred. This seems to point to the fact that when the non-matching condition is evaluated, the InactiveAlertConditionMessage is never received/sent.
   
 - Missing log messages stating: Deleted 2 stale AlertDampeningEvents for AlertDefinition[id=10001]
 
   It is not clear to me what these messages refer to but in my test case, I always see this when the condition is evaluated to true.

--- Additional comment from Lukas Krejci on 2013-11-07 08:19:38 EST ---

The logs lead me to believe the following have happened:

1) The alert condition receives data every 20minutes.
2) Once an hour, there's a data purge job that includes the baselines calculation
3) The baseline calculation job marks all the agents as needing their alert condition caches reloaded
4) An HA heartbeat (running every 30s) detects the need to reload the agent alert condition caches
5) After a cache reload, the "counting" needed for dampening is broken, because we loose the state of the cache elements (their "active" state). I.e. the dampening state still has the count > 0, because it didn't receive the inacitivity message, but the corresponding cache element is not active anymore and so will never actually send the inactivity message again, unless the condition is satisfied and then unsatisfied, within the 1 hour cycle between the cache reloads.
6) As a consequence of 5), once the count reaches the given dampening constraint (3), we will fire the alert every time the condition is satisified. This is because the condition is evaluated every 20minutes and every 3rd time the condition is satisfied (i.e. once in an hour, AFTER the caches have been cleaned).

A time series to explain it more visually:

(dampening count = 2)
(active = true)
|
|
value = 0, condition satisfied -> dampening count = 3
**** alert fires ****
|
|
(data purge -> baseline calc -> cache reload -> active = false)
|
|
value != 0, condition not satisfied, active == false, dampening count == 3
|
|
value != 0, condition not satisfied, active == false, dampening count == 3
|
|
value = 0, condition satisfied -> active = true, dampening count == 3
**** alert fires ****
|
|
(data purge -> baseline calc -> cache reload -> active = false)
|
|
value != 0, condition not satisfied, active == false, dampening count == 3
|
|
value != 0, condition not satisfied, active == false, dampening count == 3
|
|
value = 0, condition satisfied -> active = true, dampening count == 3
**** alert fires ****
|
|
...

Now for the ways to fix this. I think all we need is for the state of the individual cache elements to survive the cache reload. In that case, the counting doesn't get broken and we should dampen as expected.

This could be quite easy (i.e. just copy the states from the old cache before we make the new reloaded cache active) but it this would still break when the agent failed over to another server in HA - the new server would load the cache but would have nowhere to transfer the state of the cache elements from - the state was held at the old server.

So the proper solution here is to store the state of the cache elements in the database. The cache elements correspond to individual cache conditions of the alert definitions so one straightforward place to store the activity state would be along with the alert conditions.
 
Another way to fix this would be to drop the notion of the activity state of the cache elements (and thus make the cache truly stateless) but that would mean that the inactivity message would be sent every time a stateful condition would not match (which would be, by the nature and purpose of alert conditions, the vast majority of times).

--- Additional comment from Lukas Krejci on 2013-11-07 08:58:59 EST ---

Oh, another solution:

Make the "active" property of the cache elements a tri-state: unknown, true, false. The default value would be unknown and the inactivity message would be sent if the condition was not matched and the activity was either true or unknown.

This would mean we wouldn't have to transfer the state on cache reloads and we also wouldn't have to store it in the database. The failover of the agents to another server would possibly work, too.

I need to investigate this further.

Comment 1 Lukas Krejci 2013-11-08 23:08:46 UTC
commit e3a41e4a8ce892a3e314db7991223f668f00fe2e
Author: Lukas Krejci <lkrejci>
Date:   Sat Nov 9 00:06:31 2013 +0100

    [BZ 1028636] Dampening broken by agent alert condition cache reloads
    
    Reloading the cache lost the "activity" state of conditions. This caused
    dampening to never receive inactivity messages and thus misbehave.
    
    Changing the cache elements to default to "active" state solves this at the
    cost of sending 1 redundant inactivity message after a cache reload. This
    is harmless and has no effect on the dampening resolution.
    
    The actual implementation changes the activity boolean to a tristate enum
    of UNKNOWN, ACTIVE and INACTIVE, where UNKNOWN and ACTIVE have the same
    semantics but are logically disticnt.

Comment 2 Heiko W. Rupp 2014-04-23 12:30:44 UTC
Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.