I've added using a server plugin a method to alert based on aggregate metrics. The code is being reviewed for an add-on to RHQ. This Bugzilla is basically a summary of that work. The plugin help text explains the plugin design pretty well: -- Alerts on resource group metrics. This is for supporting an alerting feature RHQ does not currently have. Usage: Tag resource groups using the following convention: 'alert:' func '(' metric op value ')' [ ',' time 'm' ] * func one of: sum avg min max avail, or for yesterday/week comparisons: sumd sumw avgd avgw. Suffixed 'd' means compare with same interval yesterday; 'w' means compare the same interval from a day 7 days ago. avail checks the number of resources currently marked as available. * metric the name of the metric for the resource type: e.g. NumberCommandsInQueue, or for 'avail' the metric is either percent (as a decimal, 0-1.0), or count as the number of hosts. Refer to the plugin.xml file for the name to use. (Partial matches are okay.) * op one of < or > * value a numeric value, parsed as double, representing the absolute value, or for comparisons a percentage represented as a decimal. For example 0.5 means 50%. * time is a numeric value, parsed as a integer, meaning the amount of time to look back. Example: alert:avg(NumberCommandsInQueue>100),5m means for this metric, if the average value is over 100 for the past 5 minutes, alert. Example: alert:avgd(NumberCommandsInQueue>1.1),30m means for this metric, if the average value yesterday was 10% over for a 30 minute window, alert. Implementation notes: * Tags are only applicable on compatible resource groups and are removed if found on resources. * Tags that have the wrong syntax are removed. * Evaluation always happens every 5 minutes for all metrics, sequentially (TODO concurrently). Sequential processing is probably okay as most aggregate queries finish very quickly (like a fraction of a second.) * Changing a tag may create orphan metrics definitions (TODO cleanup automatically). These are unlikely to cause trouble if they accumulate, however. They can always be deleted from the UI. * Sums are computed by taking the metric average and multiplying by the current size of the group. (It is not a true sum.) Adding and removing resources should not affect the estimate. There are many assumptions. One is that metrics are actually being gathered for every active resource, and also that they are being captured more frequently than the window. If there are NaNs or no values found, these are logged as warnings. (Shouldn't they appear as seperate alerts?) --- Although functional, what needs to come out to be a full-fledged feature: * UI Support for listing/acknowledging alerts that have no corresponding resource, but come from a resource group * UI Support for creating and modifying these sorts of alert definitions * Better scheduling (frequency is fixed) and concurrency support (checks are serialized). However, with Cassandra, it seems even with very large groups (100 metrics), metrics checks are very fast, so potentially a thousand
On aggregate sums; see Bug 675775
Created attachment 819903 [details] Project containing the aggregate alert processor Build and deploy to RHQ 4.9. Also you should patch the server with the patches provided.
Created attachment 819904 [details] alert notification fixes
Created attachment 819905 [details] GUI fix for alerts with null resources