Bug 535435 (RHQ-2130) - Make it possible to query availability state in the alert conditions
Summary: Make it possible to query availability state in the alert conditions
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: RHQ-2130
Product: RHQ Project
Classification: Other
Component: Alerts
Version: unspecified
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: RHQ 4.4.0
Assignee: Jay Shaughnessy
QA Contact:
URL: http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks: 741450
TreeView+ depends on / blocked
 
Reported: 2009-06-05 19:43 UTC by Lukas Krejci
Modified: 2013-09-01 10:09 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-01 10:09:43 UTC
Embargoed:


Attachments (Terms of Use)

Description Lukas Krejci 2009-06-05 19:43:00 UTC
Currently it is not possible to create an alert condition checking availability state of a resource - only its change (i.e. goes DOWN, goes UP).

This is prevents us from creating dampening rules that would smooth out for example short-term network outages.

The reason why it is not technically possible now is that we only report availability changes from the agents, not the states of all resources every time an availability report is generated.

The simplest solution to the problem would be to change semantics of the AlertConditionCacheManagerLocal.checkConditions(Availability...) method (that boils down to GlobalConditionsCache.checkConditions(Availability...) method).

Today this method loops through the reported availability changes (its arguments), checks the global cache if there is an availability condition matching the availability change and processes the cached condition.

If we changed this method to simply match all the availability conditions every time it's called (merging the provided availability changes with the last known state of the values) we'd achieve the re-evaluation of all the availability conditions on receiving an availability report. Obviously this solution is much more computationally intensive than just re-evaluating the changed availabilities. We could optimize the situation by creating a specialized cache that would only contain conditions that deal with availability state (i.e. is down, is up) and use that instead of the full set of availability related conditions as indicated above.



Comment 1 Rodrigo A B Freire 2009-06-15 14:54:14 UTC
Relevant case: https://enterprise.redhat.com/issue-tracker/302566

Comment 2 Joseph Marques 2009-09-02 13:43:10 UTC
There are two parts to this:

1) The availability report from the agent (99% of the time) only reports deltas up to the server - this was done because of the RLE (run-length encoded) nature of availability data
2) You can only alert on deltas (the cache only checks for deltas) - this was done specifically because of #1

However, there a few other things to keep in mind.  We have the live availability precomputed for every resource in the rhq_resource_avail table, so crafting an in-memory cache of the current availabilities /could/ be done (a single query), but...availability data doesn't always come from the agent (the suspect-agent / backfiller job can mark resources as down too) so we'd need to implement a cache reloading mechanism for when availability data becomes stale.

An alternate solution could be a system-level configuration that either turns RLE on and off.  If RLE is off, then the agent will always report availability for all of its managed resources, which would enable availability-based alerting to have 4 possible options: goes down, comes up, is down, is up (the last two being possible when RLE is off).

Comment 3 Red Hat Bugzilla 2009-11-10 20:58:26 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-2130


Comment 4 Jay Shaughnessy 2012-02-28 20:27:17 UTC
This has basically been addressed with Availability Duration alerting in the
jshaughn/avail branch.

See:

http://rhq-project.org/display/RHQ/Design-Availability+Checking#Design-AvailabilityChecking-AddAvailabilityDurationAlerting

Comment 5 Jay Shaughnessy 2012-03-30 20:40:39 UTC
This is in master.

Comment 6 Heiko W. Rupp 2013-09-01 10:09:43 UTC
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.


Note You need to log in before you can comment on or make changes to this bug.