Red Hat Bugzilla – Bug 535435
Make it possible to query availability state in the alert conditions
Last modified: 2013-09-01 06:09:43 EDT
Currently it is not possible to create an alert condition checking availability state of a resource - only its change (i.e. goes DOWN, goes UP).
This is prevents us from creating dampening rules that would smooth out for example short-term network outages.
The reason why it is not technically possible now is that we only report availability changes from the agents, not the states of all resources every time an availability report is generated.
The simplest solution to the problem would be to change semantics of the AlertConditionCacheManagerLocal.checkConditions(Availability...) method (that boils down to GlobalConditionsCache.checkConditions(Availability...) method).
Today this method loops through the reported availability changes (its arguments), checks the global cache if there is an availability condition matching the availability change and processes the cached condition.
If we changed this method to simply match all the availability conditions every time it's called (merging the provided availability changes with the last known state of the values) we'd achieve the re-evaluation of all the availability conditions on receiving an availability report. Obviously this solution is much more computationally intensive than just re-evaluating the changed availabilities. We could optimize the situation by creating a specialized cache that would only contain conditions that deal with availability state (i.e. is down, is up) and use that instead of the full set of availability related conditions as indicated above.
Relevant case: https://enterprise.redhat.com/issue-tracker/302566
There are two parts to this:
1) The availability report from the agent (99% of the time) only reports deltas up to the server - this was done because of the RLE (run-length encoded) nature of availability data
2) You can only alert on deltas (the cache only checks for deltas) - this was done specifically because of #1
However, there a few other things to keep in mind. We have the live availability precomputed for every resource in the rhq_resource_avail table, so crafting an in-memory cache of the current availabilities /could/ be done (a single query), but...availability data doesn't always come from the agent (the suspect-agent / backfiller job can mark resources as down too) so we'd need to implement a cache reloading mechanism for when availability data becomes stale.
An alternate solution could be a system-level configuration that either turns RLE on and off. If RLE is off, then the agent will always report availability for all of its managed resources, which would enable availability-based alerting to have 4 possible options: goes down, comes up, is down, is up (the last two being possible when RLE is off).
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-2130
This has basically been addressed with Availability Duration alerting in the
This is in master.
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.