Bug 1025491

Summary:

Alerts get triggered even when dampening rules have not been satisfied due to missing InactiveAlertConditionMessage

Product:

[JBoss] JBoss Operations Network

Reporter:

Larry O'Leary <loleary>

Component:

Monitoring - Alerts

Assignee:

Lukas Krejci <lkrejci>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Mike Foley <mfoley>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

JON 3.1.1

CC:

lkrejci

Target Milestone:

ER07

Target Release:

JON 3.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1028636 1029656 (view as bug list)

Environment:

Last Closed:

2014-01-02 20:43:16 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1012435, 1028636, 1029656

Attachments:

Description	Flags
Log excerpt from original case where this issue is occurring	none

Description Larry O'Leary 2013-10-31 19:40:06 UTC

Created attachment 818042 [details]
Log excerpt from original case where this issue is occurring

Description of problem:
When an alert definition exists that uses a dampening rule that requires the condition to match three consecutive times, the alert is sometimes fired even though the required count has not been reached.

For example, take the following alert definition:
    
    *   *Name*: `Dampening Alert`
    *   *Condition Type*: _Measurement Absolute Value Threshold_
    *   *Metric*: _Pattern 3 Metric_
    *   *Comparator*: _= (Equal to)_
    *   *Metric Value*: `0`
    *   *Dampening*: _Consecutive_
    *   *Occurrences*: `3`

The expectation is that Pattern 3 Metric must be equal to 0, 3 consecutive times. However, the alert is triggered when the following pattern is seen:

    0.05000050000500005
    0.049999333342222105
    0
    0.04999966666888887
    0.04999966666888887
    0
    0.049997666775550474
    0.050003166867234924
    0

This seems very similar to Bug 698320 - Dampening on consecutive count does not work for equals comparision but as this was already fixed in JBoss ON 3.1.1 and local testing using the test case identified in bug 698320 confirms that the issue is no longer present in 3.1.1, I do not believe this is the same issue..

Instead, in this case it seems that the unmatched condition is not triggering the InactiveAlertConditionMessage. It is not clear on why this message would not be generated or handled.

Version-Release number of selected component (if applicable):
4.4.0.JON311GA

Additional info:
During my unsuccessful attempt to reproduce this issue, some things that stand out as odd when comparing my test results to what was provided in the original case is:

 - Log seems to indicate that the received condition message is always ActiveAlertConditionMessage
 
   In local testing it appears that this should be InactiveAlertConditionMessage after the previous non-matching condition occurred. This seems to point to the fact that when the non-matching condition is evaluated, the InactiveAlertConditionMessage is never received/sent.
   
 - Missing log messages stating: Deleted 2 stale AlertDampeningEvents for AlertDefinition[id=10001]
 
   It is not clear to me what these messages refer to but in my test case, I always see this when the condition is evaluated to true.

Comment 2 Lukas Krejci 2013-11-07 13:19:38 UTC

The logs lead me to believe the following have happened:

1) The alert condition receives data every 20minutes.
2) Once an hour, there's a data purge job that includes the baselines calculation
3) The baseline calculation job marks all the agents as needing their alert condition caches reloaded
4) An HA heartbeat (running every 30s) detects the need to reload the agent alert condition caches
5) After a cache reload, the "counting" needed for dampening is broken, because we loose the state of the cache elements (their "active" state). I.e. the dampening state still has the count > 0, because it didn't receive the inacitivity message, but the corresponding cache element is not active anymore and so will never actually send the inactivity message again, unless the condition is satisfied and then unsatisfied, within the 1 hour cycle between the cache reloads.
6) As a consequence of 5), once the count reaches the given dampening constraint (3), we will fire the alert every time the condition is satisified. This is because the condition is evaluated every 20minutes and every 3rd time the condition is satisfied (i.e. once in an hour, AFTER the caches have been cleaned).

A time series to explain it more visually:

(dampening count = 2)
(active = true)
|
|
value = 0, condition satisfied -> dampening count = 3
**** alert fires ****
|
|
(data purge -> baseline calc -> cache reload -> active = false)
|
|
value != 0, condition not satisfied, active == false, dampening count == 3
|
|
value != 0, condition not satisfied, active == false, dampening count == 3
|
|
value = 0, condition satisfied -> active = true, dampening count == 3
**** alert fires ****
|
|
(data purge -> baseline calc -> cache reload -> active = false)
|
|
value != 0, condition not satisfied, active == false, dampening count == 3
|
|
value != 0, condition not satisfied, active == false, dampening count == 3
|
|
value = 0, condition satisfied -> active = true, dampening count == 3
**** alert fires ****
|
|
...

Now for the ways to fix this. I think all we need is for the state of the individual cache elements to survive the cache reload. In that case, the counting doesn't get broken and we should dampen as expected.

This could be quite easy (i.e. just copy the states from the old cache before we make the new reloaded cache active) but it this would still break when the agent failed over to another server in HA - the new server would load the cache but would have nowhere to transfer the state of the cache elements from - the state was held at the old server.

So the proper solution here is to store the state of the cache elements in the database. The cache elements correspond to individual cache conditions of the alert definitions so one straightforward place to store the activity state would be along with the alert conditions.
 
Another way to fix this would be to drop the notion of the activity state of the cache elements (and thus make the cache truly stateless) but that would mean that the inactivity message would be sent every time a stateful condition would not match (which would be, by the nature and purpose of alert conditions, the vast majority of times).

Comment 3 Lukas Krejci 2013-11-07 13:58:59 UTC

Oh, another solution:

Make the "active" property of the cache elements a tri-state: unknown, true, false. The default value would be unknown and the inactivity message would be sent if the condition was not matched and the activity was either true or unknown.

This would mean we wouldn't have to transfer the state on cache reloads and we also wouldn't have to store it in the database. The failover of the agents to another server would possibly work, too.

I need to investigate this further.

Comment 4 Lukas Krejci 2013-11-08 23:09:56 UTC

Upstream bug 1028636 has been fixed in master.

Comment 5 Lukas Krejci 2013-11-14 16:19:48 UTC

commit 73e95ca7c090dd54fec5057c2750385875b9a0ec
Author: Lukas Krejci <lkrejci>
Date:   Sat Nov 9 00:06:31 2013 +0100

    [BZ 1028636] Dampening broken by agent alert condition cache reloads
    
    Reloading the cache lost the "activity" state of conditions. This caused
    dampening to never receive inactivity messages and thus misbehave.
    
    Changing the cache elements to default to "active" state solves this at the
    cost of sending 1 redundant inactivity message after a cache reload. This
    is harmless and has no effect on the dampening resolution.
    
    The actual implementation changes the activity boolean to a tristate enum
    of UNKNOWN, ACTIVE and INACTIVE, where UNKNOWN and ACTIVE have the same
    semantics but are logically disticnt.
    (cherry picked from commit e3a41e4a8ce892a3e314db7991223f668f00fe2e)

Comment 6 Simeon Pinder 2013-11-19 15:48:15 UTC

Moving to ON_QA as available for testing with new brew build.

Comment 7 Larry O'Leary 2013-11-20 23:06:11 UTC

Steps to reproduce (keep in mind this is for 3.1.2 but will probably work fine for 3.2):

1.  Start JBoss ON 3.1.2 system.
2.  Install pattern-plugin from upstream RHQ 4.4.0 (4.9.0 for 3.2).
3.  Import pattern resource into inventory.
4.  Disable metric collection for all metrics except Pattern 1 Metric
5.  Set collection interval for Pattern 1 Metric to 2 minutes.
5.  Set pattern to be 0, 1 (1 zeros and 1 one).
6.  Create the following alert for the pattern resource:

    1.  Alert _Name_: `Alert - Two 0s in a Row`
    
        *   _Condition Type_: _Measurement Absolute Value Threshold_
        *   _Metric_: _Pattern 1 Metric_
        *   _Comparator_: _= (Equal to)_
        *   _Metric Value_: `0`
        *   _Dampening_: _Consecutive_
        *   _Occurrences_: `2` 

7.  Restart agent to reset pattern.
8.  Monitor the graphs page for the last 4 minutes and wait for the pattern to occur of 0 and 1.
9.  After the latest metric value of 0 is received, force a data purge job to run:

        http://localhost:7080/admin/test/control.jsp?mode=dataPurgeJob
        
10. Wait for the next 1 and 0 values to be reported (approximately 4 minutes).


Actual result:
Alert - Two 0s in a Row is fired even though only one 0 occurs (0, 1, 0, 1, 0, 1...)


Expected result:
No alert is triggered as the two consecutive 0s will never occur.

Comment 8 Simeon Pinder 2013-11-22 05:13:45 UTC

Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.