Bug 749326

Summary:	Add info about triggering drift to drift alert
Product:	[Other] RHQ Project	Reporter:	Charles Crouch <ccrouch>
Component:	Alerts	Assignee:	Nobody <nobody>
Status:	NEW ---	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	medium
Version:	4.2	CC:	hrupp, jshaughn, mazz
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	745494

Description Charles Crouch 2011-10-26 18:04:01 UTC

When an alert is fired on the basis of drift being detected it useful to know that a particular drift definition associated with a given resource has experienced drift, but it would be even more useful if the alert email also contained:

a) the list of files impacted by drift
b) the change in each file.

Basically the alert email for drift should be like a commit message: what got added/removed/changed. Obviously if there is 50mb worth of change then it doesn't make sense to try to jam all that in the email. We would need to truncate the information after a certain point.

Comment 1 John Mazzitelli 2011-10-27 21:15:36 UTC

This is going to require the alert subsystem to be enhanced to support this feature.

First, an explanation of how an alert email is fired:

1) something happens that requires the server to check to see if an alert condition needs to be checked. In this case, a drift message from the agent came in. DriftManagerBean, you will notice, has calls to "notifyAlertConditionCacheManager". This is the entry into the alert subsystem - this call passes in a drift summary object, which today includes the filenames that changed and some additional information on the drift (but note it does NOT include the diff).

2) The notifyAlertConditionCacheManager will find any drift cache element and will check to see if a drift condition matches (this is where we'll check to see if a drift condition exists, and if it does, does the new drift match the name of the drift definition or does it match a drift file regex - if so, a condition is marked as "true").

3) during the condition processing, specifically in "AbstractConditionCache.processCacheElements", if it is determined that an alert condition is true, it will put a message on the JMS queue. This is done here:

cachedConditionProducer.sendActivateAlertConditionMessage(
cacheElement.getAlertConditionTriggerId(), timestamp, cacheElement
.convertValueToString(providedValue), extraParams);

NOTE: this does NOT mean an alert should trigger - it just means a condition is true. We don't know yet if an alert should fire or not.

4) The message is received by the JMS consumer bean: AlertConditionConsumerBean. Through some machinations (e.g. it has to use the AlertDampeningManagerBean to determine if an alert should be dampened), eventually an alert may fire - which means a call to this is made: AlertManagerBean.fireAlert()

5) Eventually, AlertManagerBean.sendAlertNotifications(Alert) is called to send notifications - which may or may not involve sending emails.

6) If the alert email server plugin is to send the notification email, it simply delegates to AlertManagerBean.sendAlertNotificationEmails which takes the "condition log" string and puts it in the email. This is how the "metric value" that triggered a metric alert shows in the email - its just a string form of the condition log value.

The point to all of this? Notice that the "Drift Summary" object is only used very early - during upfront processing of the condition. Once the condition is checked, the summary information is discarded. By the time the alert email is actually sent, we've been many hoops and, in fact, may even end up on a DIFFERENT RHQ Server (the JMS message may be picked up and processed elsewhere). So we do not have the information about what files changed or the diff (in fact, we never had the diff).

You'll notice in step 3 when we put information on the JMS Queue, it only contains very small information like the string value of the "provided value" (this is, for example, the actual metric value that triggered a metric alert definition). In the case of drift, today it is nothing. But we could presumably set it to be something. But its just a string. And its purpose is to show in the UI (its appears on the row of the table for the condition of the alert). It does also show in the email, step 6.

So, it looks like this is close, but I don't think this can fully support what we want as-is. This provided value to be shown in the UI (and email) is intended to be a small, no more than 1 or 2 lines, of information. It is the "condition log" VALUE string. There is no notion of having a large piece of data that comes with the condition log that provides more information about the alert other than a small piece of data indicating why the condition was true (in drift case, this would be something like the filename that changed that matched the file regex of the alert definition, or the name of the drift definition that matched the regex of the alert definition). Here we want to add the new notion of "additional information" about an alert condition log, such as the diff of all files within a changeset or the list of all files in a changeset.

This seems to require additional enhancement to the alert subsystem to allow for such data to flow into the email notification.

Comment 2 Jay Shaughnessy 2011-10-28 14:54:18 UTC

It seems that perhaps the "condition log" value could be for display, or
could maybe be some sort of special token, something that could enable a
lazy call to "replace" the token with actual data for display.  Or, as
you mention above, perhaps a second field for that purpose, so you could
always display things just like today, and pull extra data only when
necessary.

The idea of pushing more data through the existing mechanism for each
condition match seems to heavy as only a small number of conditions
may ever be incorporated into an alert firing. A lazy mechanism that
can generate the detail info if and when it's necessary would be
nicer, I think, and would not affect today's flow.  Although, if
lazily, or on-demand, evaluated then there is always a chance the
data may no longer be available.  But this is unlikely for alerts as
the window between condition generation and alert firing/notification
is generally small.  UI presentation is another matter, which could happen
much later.

Comment 3 Charles Crouch 2011-11-02 14:27:24 UTC

Dropping priority for 3.0 given the scale of the changes required