Bug 741414

Summary: alerts with compound AND conditions can incorrectly fire when one of the conditions goes from true to false within 30 seconds of the other condition going from from false to true
Product: [Other] RHQ Project Reporter: Ian Springer <ian.springer>
Component: AlertsAssignee: RHQ Project Maintainer <rhq-maint>
Status: NEW --- QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: low    
Version: 4.1CC: flo_bugzilla, gerhard.dreschler, hbrock, hrupp, mazz
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=735262
https://bugzilla.redhat.com/show_bug.cgi?id=737565
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description Ian Springer 2011-09-26 14:22:51 EDT
If the following alert is defined:

(metricA > 10) AND (metricB < 20)

And a metric report comes in containing:

metricA = 11, metricB = 21

followed by a report containing:

metricA = 9, metricB = 19

The alert can incorrectly fire after the 2nd report is processed if the matched conditions happen to be processed by the condition consumer MDB in the order: (metricB < 20) then (metricA > 10).

This can happen because AbstractConditionCache.processCacheElements() processes the metric datums from a given metric report as follows:

- for each datum:
--- for each cached condition for the metric def corresponding to the datum:
----- a) evaluate the datum value against the condition
----- b) publish either a positive or negative condition to the JMS condition queue

As each condition is published to the queue, the AlertConditionConsumerBean MDB will either a) create/update an alert condition log and then see if the full condition set is true if the condition is positive, or b) delete any existing alert condition log (i.e. an invalidated condition log) if the condition is negative

In our example, if the (metricB = 19) datum is processed first, the alert will fire even though metricA no longer equals 9, because the (metricA = 9) datum has not been processed yet.

This bug can only occur if the same metric report contains a datum that would cause the first condition to g from true to false, and a datum that would cause the second condition to go from false to true. Since each Agent sends metric reports every 30 seconds, the bug can only occur if one condition goes to true and the other to false within a 30 second window on the Agent.

I think the fix would be to rewrite AbstractConditionCache.processCacheElements() to do the following:

1) for each datum:
--- for each cached condition for the metric def corresponding to the datum:
----- 1) evaluate the datum value against the condition
----- 2) store the condition eval results in two temporary lists, one containing all the conditions that were positive and the other containing all the conditions that were negative
2) for each condition in the list of negative conditions, publish a negative condition to the JMS condition queue
3) for each condition in the list of positive conditions, publish a positive condition to the JMS condition queue

Publishing all of the negative conditions before publishing any of the positive conditions will ensure that any invalidated condition logs are deleted prior to the positive conditions being published and potentially causing the alert to fire.
Comment 1 Ian Springer 2011-09-26 14:31:43 EDT
As of [master 8ada3a7], the pattern-generator plugin can be used to reproduce this bug as follows:

- define an alert with the following condition: 
  (Pattern 1 Metric = 0) AND (Pattern 2 Metric = 0)

The plugin always reports either:

  Pattern 1 Metric = 0, Pattern 2 Metric = 1

or:

  Pattern 1 Metric = 1, Pattern 2 Metric = 0

so the alert should never fire. However, due to this bug, it will fire.
Comment 2 John Mazzitelli 2011-09-26 14:39:05 EDT
see bug #735262 for fixing a specialized form of this issue (that is, with the same metric used in multiple conditions)

bug #737565 forces the user to pick different metrics per condition. but this issue shows that even different metrics still exhibit odd behavior under rare conditions.