Bug 535435 (RHQ-2130)

Summary: Make it possible to query availability state in the alert conditions
Product: [Other] RHQ Project Reporter: Lukas Krejci <lkrejci>
Component: AlertsAssignee: Jay Shaughnessy <jshaughn>
Severity: medium Docs Contact:
Priority: medium    
Version: unspecifiedCC: cwelton, jshaughn, rbs
Target Milestone: ---Keywords: FutureFeature
Target Release: RHQ 4.4.0   
Hardware: All   
OS: All   
URL: http://jira.rhq-project.org/browse/RHQ-2130
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-01 06:09:43 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 741450    

Description Lukas Krejci 2009-06-05 15:43:00 EDT
Currently it is not possible to create an alert condition checking availability state of a resource - only its change (i.e. goes DOWN, goes UP).

This is prevents us from creating dampening rules that would smooth out for example short-term network outages.

The reason why it is not technically possible now is that we only report availability changes from the agents, not the states of all resources every time an availability report is generated.

The simplest solution to the problem would be to change semantics of the AlertConditionCacheManagerLocal.checkConditions(Availability...) method (that boils down to GlobalConditionsCache.checkConditions(Availability...) method).

Today this method loops through the reported availability changes (its arguments), checks the global cache if there is an availability condition matching the availability change and processes the cached condition.

If we changed this method to simply match all the availability conditions every time it's called (merging the provided availability changes with the last known state of the values) we'd achieve the re-evaluation of all the availability conditions on receiving an availability report. Obviously this solution is much more computationally intensive than just re-evaluating the changed availabilities. We could optimize the situation by creating a specialized cache that would only contain conditions that deal with availability state (i.e. is down, is up) and use that instead of the full set of availability related conditions as indicated above.

Comment 1 Rodrigo A B Freire 2009-06-15 10:54:14 EDT
Relevant case: https://enterprise.redhat.com/issue-tracker/302566
Comment 2 Joseph Marques 2009-09-02 09:43:10 EDT
There are two parts to this:

1) The availability report from the agent (99% of the time) only reports deltas up to the server - this was done because of the RLE (run-length encoded) nature of availability data
2) You can only alert on deltas (the cache only checks for deltas) - this was done specifically because of #1

However, there a few other things to keep in mind.  We have the live availability precomputed for every resource in the rhq_resource_avail table, so crafting an in-memory cache of the current availabilities /could/ be done (a single query), but...availability data doesn't always come from the agent (the suspect-agent / backfiller job can mark resources as down too) so we'd need to implement a cache reloading mechanism for when availability data becomes stale.

An alternate solution could be a system-level configuration that either turns RLE on and off.  If RLE is off, then the agent will always report availability for all of its managed resources, which would enable availability-based alerting to have 4 possible options: goes down, comes up, is down, is up (the last two being possible when RLE is off).
Comment 3 Red Hat Bugzilla 2009-11-10 15:58:26 EST
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-2130
Comment 4 Jay Shaughnessy 2012-02-28 15:27:17 EST
This has basically been addressed with Availability Duration alerting in the
jshaughn/avail branch.


Comment 5 Jay Shaughnessy 2012-03-30 16:40:39 EDT
This is in master.
Comment 6 Heiko W. Rupp 2013-09-01 06:09:43 EDT
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.