1019472 – Support for aggregate alerting

Bug 1019472 - Support for aggregate alerting

Summary: Support for aggregate alerting

Keywords:
Status:	NEW
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Alerts
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-15 19:40 UTC by Elias Ross
Modified:	2022-03-31 04:28 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)
Project containing the aggregate alert processor (160.00 KB, application/octet-stream) 2013-11-05 17:18 UTC, Elias Ross	no flags	Details
alert notification fixes (11.77 KB, patch) 2013-11-05 17:21 UTC, Elias Ross	no flags	Details \| Diff
GUI fix for alerts with null resources (11.37 KB, patch) 2013-11-05 17:21 UTC, Elias Ross	no flags	Details \| Diff
View All

Description Elias Ross 2013-10-15 19:40:16 UTC

I've added using a server plugin a method to alert based on aggregate metrics.

The code is being reviewed for an add-on to RHQ.

This Bugzilla is basically a summary of that work. The plugin help text explains the plugin design pretty well:

Alerts on resource group metrics. This is for supporting an alerting feature RHQ does not currently have.
Usage: Tag resource groups using the following convention:

'alert:' func '(' metric op value ')' [ ',' time 'm' ]

* func
one of: sum avg min max avail, or for yesterday/week comparisons: sumd sumw avgd
avgw. Suffixed 'd' means compare with same interval yesterday; 'w' means compare the same interval from a day 7 days ago. avail checks the number of resources currently marked as available.
* metric
the name of the metric for the resource type: e.g. NumberCommandsInQueue, or for 'avail' the metric is either percent (as a decimal, 0-1.0), or count as the number of hosts. Refer to the plugin.xml file for the name to use. (Partial matches are okay.)
* op
one of < or >
* value
a numeric value, parsed as double, representing the absolute value, or for comparisons a percentage represented as a decimal. For example 0.5 means 50%.
* time
is a numeric value, parsed as a integer, meaning the amount of time to look back.
Example: alert:avg(NumberCommandsInQueue>100),5m means for this metric, if the average value is over 100 for the past 5 minutes, alert.

Example: alert:avgd(NumberCommandsInQueue>1.1),30m means for this metric, if the average value yesterday was 10% over for a 30 minute window, alert.

Implementation notes:

* Tags are only applicable on compatible resource groups and are removed if found on resources.
* Tags that have the wrong syntax are removed.
* Evaluation always happens every 5 minutes for all metrics, sequentially (TODO concurrently). Sequential processing is probably okay as most aggregate queries finish very quickly (like a fraction of a second.)
* Changing a tag may create orphan metrics definitions (TODO cleanup automatically). These are unlikely to cause trouble if they accumulate, however. They can always be deleted from the UI.
* Sums are computed by taking the metric average and multiplying by the current size of the group. (It is not a true sum.) Adding and removing resources should not affect the estimate.

There are many assumptions. One is that metrics are actually being gathered for every active resource, and also that they are being captured more frequently than the window. If there are NaNs or no values found, these are logged as warnings. (Shouldn't they appear as seperate alerts?)

---

Although functional, what needs to come out to be a full-fledged feature:
* UI Support for listing/acknowledging alerts that have no corresponding resource, but come from a resource group
* UI Support for creating and modifying these sorts of alert definitions
* Better scheduling (frequency is fixed) and concurrency support (checks are serialized). However, with Cassandra, it seems even with very large groups (100 metrics), metrics checks are very fast, so potentially a thousand

Comment 1 Elias Ross 2013-10-15 21:24:01 UTC

On aggregate sums; see Bug 675775

Comment 2 Elias Ross 2013-11-05 17:18:11 UTC

Created attachment 819903 [details]
Project containing the aggregate alert processor

Build and deploy to RHQ 4.9.

Also you should patch the server with the patches provided.

Comment 3 Elias Ross 2013-11-05 17:21:10 UTC

Created attachment 819904 [details]
alert notification fixes

Comment 4 Elias Ross 2013-11-05 17:21:49 UTC

Created attachment 819905 [details]
GUI fix for alerts with null resources

Note You need to log in before you can comment on or make changes to this bug.