If you create an alert definition with two conditions using the AND conjunction, and those two conditions use the same metric, alerts can fire at times when it appears they should not. For example, to replicate, create an alert definition that mimics these conditions (in my case, I used a CPU resource, metric "User Load"): IF Metric "User Load" > 40% AND IF Metric "User Load" < 60% I then just waited (I did things on my box to get the CPU to spike and then idle - I'm sure there are other metrics that are more easily reproducible in a more controlled and repeatable way, but this worked for me with in a few minutes) I ended up getting the following alert triggered: If Condition: User Load < 60.0% Actual Value 14.6% AND If Condition: User Load > 40.0% Actual Value: 40.6% First, notice that the actual value used is two different numbers - at first glance you would think that is wrong - how could we evaluate this condition set with two different numbers. Second, notice that this set would have evaluated to false for the 14.6% value had we evaluated the full condition set at the time that value came in (that is to say, this is false "40% > 14.6% < 60%") The conditions have different values because each condition is tested independently of the others. So, when User Load metric came in at 14.6%, each condition is tested independently: a) the first condition evaluates to TRUE (14.6% < 60%), and was stored as a "triggered condition", b) the second condition evaluates to FALSE (14.6% > 40%). Therefore, the alert doesn't fire, but the first condition is flagged as true. Now a new metric report comes in and the metric value was now 40.6% - now let's evaluate the conditions again: a) the first condition still evaluates to TRUE (40.6% < 60%) b) the second condition is now TRUE (40.6% > 40%), and is now stored as a "triggered condition". Once that second metric report came in, and that second condition was flipped to two, the alert definition's condition set is now TRUE because both conditions are TRUE. An alert fires - and this is why we see the numbers we see in the alert: If Condition: User Load < 60.0% Actual Value 14.6% AND If Condition: User Load > 40.0% Actual Value: 40.6% As per the design page, talking about the "ALL" conjunction (i.e. "AND") http://rhq-project.org/display/RHQ/Alerts#Alerts-ConditionExpression "It doesn't matter if one condition is known to be true several times in a row. The last known value for each condition must be true simultaneously before this alert definition will fire an alert." So, as long as both conditions are true simultaneously (and they are as you see in the above analysis), an alert will fire. The problem here is we are using the SAME metric with DIFFERENT conditions. Not sure what the solution is.
Here is data from my database, to show what happened: RHQ_ALERT_DEFINITION ==================== id name 10001 test alert RHQ_ALERT_CONDITION =================== id type name comparator threshold alert_definition_id 10001 THRESHOLD User Load > 0.4 10001 10002 THRESHOLD User Load < 0.6 10001 RHQ_ALERT_CONDITION_LOG ======================= id ctime alert_id condition_id value 10003 1314931885754 (22:51:25) (null) 10002 0.168 10021 1314931735753 (22:48:55) 10012 10001 0.407 10012 1314931705754 (22:48:25) 10011 10001 0.413 10013 1314931705754 (22:48:25) 10012 10002 0.413 10002 1314931675753 (22:47:55) 10001 10001 0.406 10011 1314931675753 (22:47:55) 10011 10002 0.406 10001 1314931645754 (22:47:25) 10001 10002 0.146 RHQ_ALERT ========= id alert_definition_id ctime 10012 10001 1314931745613 (22:49:05) 10011 10001 1314931715616 (22:48:35) 10001 10001 1314931685653 (22:48:05)
Here's where you can see this happen, just by looking at the data alone: 1) At 22:47:25, we got a User Load value of 14.6% (0.146). This makes condition #10002 TRUE (condition User Load < 60% (0.6)). Condition log #10001 is inserted. 2) Our next measurement report came in 30 seconds later (my metric collection interval was 30 seconds). So at 22:47:55, the new metric value that was collected was 40.6% (0.406). This makes condition #10001 TRUE (condition User Load > 40% (0.4)). Condition log #10002 is inserted. 3) Both conditions are now TRUE which satisfies the ALL conjunction. Alert #10001 is fired at 22:48:05 which is when it was determined that all conditions are true and the condition set itself evaluates to TRUE. RHQ_ALERT_CONDITION =================== id type name comparator threshold alert_definition_id 10001 THRESHOLD User Load > 0.4 10001 10002 THRESHOLD User Load < 0.6 10001 RHQ_ALERT_CONDITION_LOG ======================= id ctime alert_id condition_id value 10002 1314931675753 (22:47:55) 10001 10001 0.406 10001 1314931645754 (22:47:25) 10001 10002 0.146 RHQ_ALERT ========= id alert_definition_id ctime 10001 10001 1314931685653 (22:48:05)
A thought for consideration... This issue could be avoided if we offered a range condition for metric thresholds. I'm guessing the reason we don't have this is because it hadn't really been asked for and that it could be expressed as multiple conditions. But now we can see the issues with multiple conditions for the same metric. It's one alternative to potentially mucking with the alert condition log correlation and it would be a useful addition going forward, in that it simplifies alert definition. While we're at it why not add <= and >= as operators. Not sure why we don't have those either.
drift branch, git commit 84ec4bd2587e7bb249e15bc9690e22ec30eddb28 we now support a "range" conditional. Rather than be forced to create two separate conditions using the same metric, you can now define a single condition with a low/high range. You can then ask to be alerted if a metric value falls outside of that range OR inside that range. This still needs to be merged into master. This has NOT been backported to older versions. To do that, we need to also fix the old JSF/struts pages so a user can actually create this new alert range conditional.
this has been merged into master. I have a patch ready to go into the release-3.0.1 branch which also updates the struts UI. Will add another comment here when that work is done.
pushed to release-3.0.1 branch (commit sha 20d95b4)
I added an FAQ, which may or may not be helpful. I didn't know the best way to put the problem into words so people could understand the symptom. http://rhq-project.org/display/JOPR2/FAQ#FAQ-WhydoIseealertstriggeredondifferentmetricvaluesondifferentalertdefinitionconditionswhentheyareusingthesamemetric%3F "Why do I see alerts triggered on different metric values on different alert definition conditions when they are using the same metric? This can occur due to the nature of how alert conditions are processed when measurement data comes in from the agent. This happens when you have a single alert definition with multiple conditions that use the same metric and that alert definition uses the "ALL" conjunction (that is, the conditions must all be true for the alert to fire). For example, do not have an alert definition that says, "alert if ALL conditions are true: if metric X > 5 and if metric X < 10". Note, however, that a new feature has been added to RHQ 4 to support range checking (which is usually why people create multiple conditions using the same metric with the ALL conjunction in an alert definition - for more information, see https://bugzilla.redhat.com/show_bug.cgi?id=735262)."
build#426 (Version: 4.1.0-SNAPSHOT Build Number: 7739090) Verified below range conditions for condition type 'Measurement value range' on the platform for the Free Memory metric: Inside exclusive Outside exclusive Inside inclusive Outside inclusive Defined alert conditions with low and high values. Verified that the alerts get fired and alert emails are received for all above range conditions.
I am reopening this case as the customer wants to define a percentage of the baseline (for instance) but our patch provides only conditions: Between (exclusive), Outside (exclusive), Between (inclusive) and Outside (inclusive) with absolute values (low and high). We should allow the same as it is currently possible for Greater than, Equal to and Less then. Also, when the patch is applied, the login page says "Welcome to RHQ". This should be changed.
do you mean they want an alert definition whose condition is: > 10% max baseline AND < 30% max baseline ??
yes, that's how I understood: "But when I tried to create a new set of alarm templates for tx datasources, I came up with a problem with the new if conditions. I wanted to a percentage of the baseline but the newly added between tests only allow fix values wich is not useful for that because the different datasources have different max values."
rather than reuse this issue, I'm re-closing this and a new bug has been added to track the need for baseline range conditions. See Bug #746337
changing status of VERIFIED BZs for JON 2.4.2 and JON 3.0 to CLOSED/CURRENTRELEASE