Bug 735262 - alert def with multiple conditions using the same metric fires alerts at wrong times
alert def with multiple conditions using the same metric fires alerts at wron...
Status: CLOSED CURRENTRELEASE
Product: RHQ Project
Classification: Other
Component: Alerts (Show other bugs)
4.0.1
Unspecified Unspecified
high Severity high (vote)
: ---
: ---
Assigned To: John Mazzitelli
Mike Foley
:
Depends On:
Blocks: jon30-bugs 740131 740135 743764 772769
  Show dependency treegraph
 
Reported: 2011-09-01 23:24 EDT by John Mazzitelli
Modified: 2012-02-07 14:18 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 769942 (view as bug list)
Environment:
Last Closed: 2012-02-07 14:18:04 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description John Mazzitelli 2011-09-01 23:24:25 EDT
If you create an alert definition with two conditions using the AND conjunction, and those two conditions use the same metric, alerts can fire at times when it appears they should not.

For example, to replicate, create an alert definition that mimics these conditions (in my case, I used a CPU resource, metric "User Load"):

IF Metric "User Load" > 40%
AND
IF Metric "User Load" < 60%

I then just waited (I did things on my box to get the CPU to spike and then idle - I'm sure there are other metrics that are more easily reproducible in a more controlled and repeatable way, but this worked for me with in a few minutes)

I ended up getting the following alert triggered:

If Condition: User Load < 60.0%
Actual Value  14.6%
AND
If Condition: User Load > 40.0%
Actual Value: 40.6%

First, notice that the actual value used is two different numbers - at first glance you would think that is wrong - how could we evaluate this condition set with two different numbers. Second, notice that this set would have evaluated to false for the 14.6% value had we evaluated the full condition set at the time that value came in (that is to say, this is false "40% > 14.6% < 60%")

The conditions have different values because each condition is tested independently of the others. So, when User Load metric came in at 14.6%, each condition is tested independently:

a) the first condition evaluates to TRUE (14.6% < 60%), and was stored as a "triggered condition",
b) the second condition evaluates to FALSE (14.6% > 40%).

Therefore, the alert doesn't fire, but the first condition is flagged as true.

Now a new metric report comes in and the metric value was now 40.6% - now let's evaluate the conditions again:

a) the first condition still evaluates to TRUE (40.6% < 60%)
b) the second condition is now TRUE (40.6% > 40%), and is now stored as a "triggered condition".

Once that second metric report came in, and that second condition was flipped to two, the alert definition's condition set is now TRUE because both conditions are TRUE. An alert fires - and this is why we see the numbers we see in the alert:

If Condition: User Load < 60.0%
Actual Value  14.6%
AND
If Condition: User Load > 40.0%
Actual Value: 40.6%

As per the design page, talking about the "ALL" conjunction (i.e. "AND")

http://rhq-project.org/display/RHQ/Alerts#Alerts-ConditionExpression

"It doesn't matter if one condition is known to be true several times in a row. The last known value for each condition must be true simultaneously before this alert definition will fire an alert."

So, as long as both conditions are true simultaneously (and they are as you see in the above analysis), an alert will fire.

The problem here is we are using the SAME metric with DIFFERENT conditions.

Not sure what the solution is.
Comment 1 John Mazzitelli 2011-09-01 23:51:44 EDT
Here is data from my database, to show what happened:

RHQ_ALERT_DEFINITION
====================
id     name
10001  test alert

RHQ_ALERT_CONDITION
===================
id     type       name       comparator threshold  alert_definition_id
10001  THRESHOLD  User Load  >          0.4        10001
10002  THRESHOLD  User Load  <          0.6        10001

RHQ_ALERT_CONDITION_LOG
=======================
id     ctime                    alert_id  condition_id  value
10003  1314931885754 (22:51:25) (null)    10002         0.168
10021  1314931735753 (22:48:55) 10012     10001         0.407
10012  1314931705754 (22:48:25) 10011     10001         0.413
10013  1314931705754 (22:48:25) 10012     10002         0.413
10002  1314931675753 (22:47:55) 10001     10001         0.406
10011  1314931675753 (22:47:55) 10011     10002         0.406
10001  1314931645754 (22:47:25) 10001     10002         0.146

RHQ_ALERT
=========
id     alert_definition_id  ctime
10012  10001                1314931745613 (22:49:05)
10011  10001                1314931715616 (22:48:35)
10001  10001                1314931685653 (22:48:05)
Comment 2 John Mazzitelli 2011-09-02 00:05:50 EDT
Here's where you can see this happen, just by looking at the data alone:

1) At 22:47:25, we got a User Load value of 14.6% (0.146). This makes condition #10002 TRUE (condition User Load < 60% (0.6)). Condition log #10001 is inserted.

2) Our next measurement report came in 30 seconds later (my metric collection interval was 30 seconds). So at 22:47:55, the new metric value that was collected was 40.6% (0.406). This makes condition #10001 TRUE (condition User Load > 40% (0.4)). Condition log #10002 is inserted.

3) Both conditions are now TRUE which satisfies the ALL conjunction. Alert #10001 is fired at 22:48:05 which is when it was determined that all conditions are true and the condition set itself evaluates to TRUE.

RHQ_ALERT_CONDITION
===================
id     type       name       comparator threshold  alert_definition_id
10001  THRESHOLD  User Load  >          0.4        10001
10002  THRESHOLD  User Load  <          0.6        10001

RHQ_ALERT_CONDITION_LOG
=======================
id     ctime                    alert_id  condition_id  value
10002  1314931675753 (22:47:55) 10001     10001         0.406
10001  1314931645754 (22:47:25) 10001     10002         0.146

RHQ_ALERT
=========
id     alert_definition_id  ctime
10001  10001                1314931685653 (22:48:05)
Comment 3 Jay Shaughnessy 2011-09-02 09:09:42 EDT
A thought for consideration...

This issue could be avoided if we offered a range condition for metric
thresholds. I'm guessing the reason we don't have this is because it 
hadn't really been asked for and that it could be expressed as multiple conditions. But now we can see the issues with multiple conditions for 
the same metric.  It's one alternative to potentially mucking with the
alert condition log correlation and it would be a useful addition going 
forward, in that it simplifies alert definition.

While we're at it why not add <= and >= as operators. Not sure why we 
don't have those either.
Comment 4 John Mazzitelli 2011-09-09 15:03:17 EDT
drift branch, git commit 84ec4bd2587e7bb249e15bc9690e22ec30eddb28

we now support a "range" conditional. Rather than be forced to create two separate conditions using the same metric, you can now define a single condition with a low/high range. You can then ask to be alerted if a metric value falls outside of that range OR inside that range.

This still needs to be merged into master.

This has NOT been backported to older versions. To do that, we need to also fix the old JSF/struts pages so a user can actually create this new alert range conditional.
Comment 5 John Mazzitelli 2011-09-12 10:06:40 EDT
this has been merged into master.

I have a patch ready to go into the release-3.0.1 branch which also updates the struts UI. Will add another comment here when that work is done.
Comment 6 John Mazzitelli 2011-09-12 10:14:41 EDT
pushed to release-3.0.1 branch (commit sha 20d95b4)
Comment 7 John Mazzitelli 2011-09-12 10:30:01 EDT
I added an FAQ, which may or may not be helpful. I didn't know the best way to put the problem into words so people could understand the symptom.

http://rhq-project.org/display/JOPR2/FAQ#FAQ-WhydoIseealertstriggeredondifferentmetricvaluesondifferentalertdefinitionconditionswhentheyareusingthesamemetric%3F

"Why do I see alerts triggered on different metric values on different alert definition conditions when they are using the same metric?

   This can occur due to the nature of how alert conditions are processed when measurement data comes in from the agent. This happens when you have a single alert definition with multiple conditions that use the same metric and that alert definition uses the "ALL" conjunction (that is, the conditions must all be true for the alert to fire). For example, do not have an alert definition that says, "alert if ALL conditions are true: if metric X > 5 and if metric X < 10". Note, however, that a new feature has been added to RHQ 4 to support range checking (which is usually why people create multiple conditions using the same metric with the ALL conjunction in an alert definition - for more information, see https://bugzilla.redhat.com/show_bug.cgi?id=735262)."
Comment 8 Sunil Kondkar 2011-09-23 08:23:51 EDT
build#426 (Version: 4.1.0-SNAPSHOT Build Number: 7739090)

Verified below range conditions for condition type 'Measurement value range' on the platform for the Free Memory metric:

Inside exclusive
Outside exclusive
Inside inclusive
Outside inclusive

Defined alert conditions with low and high values. Verified that the alerts get fired and alert emails are received for all above range conditions.
Comment 9 bkramer 2011-10-14 08:47:00 EDT
I am reopening this case as the customer wants to define a percentage of the baseline (for instance) but our patch provides only conditions: Between (exclusive), Outside (exclusive), Between (inclusive) and Outside (inclusive) with absolute values (low and high). We should allow the same as it is currently possible for Greater than,  Equal to and Less then.

Also, when the patch is applied, the login page says "Welcome to RHQ". This should be changed.
Comment 10 John Mazzitelli 2011-10-14 10:48:46 EDT
do you mean they want an alert definition whose condition is:

> 10% max baseline AND < 30% max baseline

??
Comment 11 bkramer 2011-10-14 11:26:22 EDT
yes, that's how I understood: 

"But when I tried to create a new set of alarm templates for tx datasources, I came up with a problem with the new if conditions. I wanted to a percentage of the baseline but the newly added between tests only allow fix values wich is not useful for that because the different datasources have different max values."
Comment 12 John Mazzitelli 2011-10-14 16:09:33 EDT
rather than reuse this issue, I'm re-closing this and a new bug has been added to track the need for baseline range conditions. See Bug #746337
Comment 13 Mike Foley 2012-02-07 14:18:04 EST
changing status of VERIFIED BZs for JON 2.4.2 and JON 3.0 to CLOSED/CURRENTRELEASE

Note You need to log in before you can comment on or make changes to this bug.