Bug 1025491
Summary: | Alerts get triggered even when dampening rules have not been satisfied due to missing InactiveAlertConditionMessage | ||||||
---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Operations Network | Reporter: | Larry O'Leary <loleary> | ||||
Component: | Monitoring - Alerts | Assignee: | Lukas Krejci <lkrejci> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | JON 3.1.1 | CC: | lkrejci | ||||
Target Milestone: | ER07 | ||||||
Target Release: | JON 3.2.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1028636 1029656 (view as bug list) | Environment: | |||||
Last Closed: | 2014-01-02 20:43:16 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1012435, 1028636, 1029656 | ||||||
Attachments: |
|
Description
Larry O'Leary
2013-10-31 19:40:06 UTC
The logs lead me to believe the following have happened: 1) The alert condition receives data every 20minutes. 2) Once an hour, there's a data purge job that includes the baselines calculation 3) The baseline calculation job marks all the agents as needing their alert condition caches reloaded 4) An HA heartbeat (running every 30s) detects the need to reload the agent alert condition caches 5) After a cache reload, the "counting" needed for dampening is broken, because we loose the state of the cache elements (their "active" state). I.e. the dampening state still has the count > 0, because it didn't receive the inacitivity message, but the corresponding cache element is not active anymore and so will never actually send the inactivity message again, unless the condition is satisfied and then unsatisfied, within the 1 hour cycle between the cache reloads. 6) As a consequence of 5), once the count reaches the given dampening constraint (3), we will fire the alert every time the condition is satisified. This is because the condition is evaluated every 20minutes and every 3rd time the condition is satisfied (i.e. once in an hour, AFTER the caches have been cleaned). A time series to explain it more visually: (dampening count = 2) (active = true) | | value = 0, condition satisfied -> dampening count = 3 **** alert fires **** | | (data purge -> baseline calc -> cache reload -> active = false) | | value != 0, condition not satisfied, active == false, dampening count == 3 | | value != 0, condition not satisfied, active == false, dampening count == 3 | | value = 0, condition satisfied -> active = true, dampening count == 3 **** alert fires **** | | (data purge -> baseline calc -> cache reload -> active = false) | | value != 0, condition not satisfied, active == false, dampening count == 3 | | value != 0, condition not satisfied, active == false, dampening count == 3 | | value = 0, condition satisfied -> active = true, dampening count == 3 **** alert fires **** | | ... Now for the ways to fix this. I think all we need is for the state of the individual cache elements to survive the cache reload. In that case, the counting doesn't get broken and we should dampen as expected. This could be quite easy (i.e. just copy the states from the old cache before we make the new reloaded cache active) but it this would still break when the agent failed over to another server in HA - the new server would load the cache but would have nowhere to transfer the state of the cache elements from - the state was held at the old server. So the proper solution here is to store the state of the cache elements in the database. The cache elements correspond to individual cache conditions of the alert definitions so one straightforward place to store the activity state would be along with the alert conditions. Another way to fix this would be to drop the notion of the activity state of the cache elements (and thus make the cache truly stateless) but that would mean that the inactivity message would be sent every time a stateful condition would not match (which would be, by the nature and purpose of alert conditions, the vast majority of times). Oh, another solution: Make the "active" property of the cache elements a tri-state: unknown, true, false. The default value would be unknown and the inactivity message would be sent if the condition was not matched and the activity was either true or unknown. This would mean we wouldn't have to transfer the state on cache reloads and we also wouldn't have to store it in the database. The failover of the agents to another server would possibly work, too. I need to investigate this further. Upstream bug 1028636 has been fixed in master. commit 73e95ca7c090dd54fec5057c2750385875b9a0ec Author: Lukas Krejci <lkrejci> Date: Sat Nov 9 00:06:31 2013 +0100 [BZ 1028636] Dampening broken by agent alert condition cache reloads Reloading the cache lost the "activity" state of conditions. This caused dampening to never receive inactivity messages and thus misbehave. Changing the cache elements to default to "active" state solves this at the cost of sending 1 redundant inactivity message after a cache reload. This is harmless and has no effect on the dampening resolution. The actual implementation changes the activity boolean to a tristate enum of UNKNOWN, ACTIVE and INACTIVE, where UNKNOWN and ACTIVE have the same semantics but are logically disticnt. (cherry picked from commit e3a41e4a8ce892a3e314db7991223f668f00fe2e) Moving to ON_QA as available for testing with new brew build. Steps to reproduce (keep in mind this is for 3.1.2 but will probably work fine for 3.2): 1. Start JBoss ON 3.1.2 system. 2. Install pattern-plugin from upstream RHQ 4.4.0 (4.9.0 for 3.2). 3. Import pattern resource into inventory. 4. Disable metric collection for all metrics except Pattern 1 Metric 5. Set collection interval for Pattern 1 Metric to 2 minutes. 5. Set pattern to be 0, 1 (1 zeros and 1 one). 6. Create the following alert for the pattern resource: 1. Alert _Name_: `Alert - Two 0s in a Row` * _Condition Type_: _Measurement Absolute Value Threshold_ * _Metric_: _Pattern 1 Metric_ * _Comparator_: _= (Equal to)_ * _Metric Value_: `0` * _Dampening_: _Consecutive_ * _Occurrences_: `2` 7. Restart agent to reset pattern. 8. Monitor the graphs page for the last 4 minutes and wait for the pattern to occur of 0 and 1. 9. After the latest metric value of 0 is received, force a data purge job to run: http://localhost:7080/admin/test/control.jsp?mode=dataPurgeJob 10. Wait for the next 1 and 0 values to be reported (approximately 4 minutes). Actual result: Alert - Two 0s in a Row is fired even though only one 0 occurs (0, 1, 0, 1, 0, 1...) Expected result: No alert is triggered as the two consecutive 0s will never occur. Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason. |