Bug 830463

Summary:	Re-enabling (actually, updating) an alert definition clears condition logs for existing alerts
Product:	[Other] RHQ Project	Reporter:	Heiko W. Rupp <hrupp>
Component:	Alerts	Assignee:	Jay Shaughnessy <jshaughn>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Mike Foley <mfoley>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	4.4	CC:	hrupp, jshaughn
Target Milestone:	---
Target Release:	RHQ 4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	833090 (view as bug list)		Environment:
Last Closed:	2013-09-01 10:16:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	833090

Description Heiko W. Rupp 2012-06-09 22:04:34 UTC

Have an alert fire and disable on fire.

Then when the alert has fired, check the condition log in the UI.

Then re-enable the alert definition and check the condition log for the existing alert again - they are gone.

Problem in the source is that somewhere the UI calls 

org.rhq.enterprise.server.alert.AlertDefinitionManagerLocal#updateAlertDefinition

with the last parameter, which is called in the interface "update internals" ,
being true.
In reality that is purging the existing condition logs, which should only happen (if at all) when the conditions change.

Interestingly enough, in the implementation class the parameter is called "purge internals"

Comment 1 Heiko W. Rupp 2012-06-09 22:09:57 UTC

To reproduce it is important to edit the alert definition to mark it as active again.

The call in the UI starts here: 
SingleAlertDefinitionView:145
->
org.rhq.enterprise.gui.coregui.client.alert.definitions.ResourceAlertDefinitionsView#commitAlertDefinition
which then adds the hardcoded 'true' to the "purgeInternals" call.

Comment 2 Heiko W. Rupp 2012-06-12 09:27:55 UTC

I set this to urgent, as the "clear condition logs" also shows up when the user is e.g. adding a new notification.

Comment 3 Jay Shaughnessy 2012-06-18 18:52:31 UTC

The issue here is not enabling necessarily because that does not by itself cause the issue.  If you enable/disable using the buttons on the alert def list view this will not happen.

The issue is editing the alert def in any way using edit mode in the gui (i.e. selecting the alert def detail view, clicking the 'Edit' button and subsequently the 'Save' button.

This mechanism is not selectively setting the 'purge internals' option but rather setting it for all updates.

I believe, as Heiko says above, that only a change to the condition set should force this option.

Looking into doing this...

Comment 4 Heiko W. Rupp 2012-06-19 07:30:07 UTC

(In reply to comment #3)
> The issue here is not enabling necessarily because that does not by itself
> cause the issue.  If you enable/disable using the buttons on the alert def
> list view this will not happen.

Yes. The enable-via-edit was just the first case where I found it.

> I believe, as Heiko says above, that only a change to the condition set
> should force this option.

And even here we should be more clever in the sense that we do not remove the old condition logs, but "freeze" them in a way that it is obvious that the data is from an outdated condition set, but still available for reference. Currently we show nothing, which makes users wonder why the alert did fire at all.
As the condition logs are also exposed at least on the REST interface, this would mean SLSB changes.

Comment 5 Jay Shaughnessy 2012-06-19 13:44:30 UTC

Additionally, changes to dampening also rightly invalidate the existing condition logs

I'm not sure yet about the suggestion above about being able to maintain old condition logs in the face of a change to the condition set or dampening.  We may have fundamental issues providing this feature. It requires investigation. I'm still concentrating on part 1 at the moment, which is to keep the condition logs when they are still valid.

Comment 6 Jay Shaughnessy 2012-06-19 15:34:48 UTC

OK, the problem is deeper than I thought.  It turns out that any time an alert def is updated that we blindly update all of conditions, dampening, notifications, etc.  Regardless of whether they actually changed.  Removing the conditions cascade deletes the condition logs (as it should).

To avoid this, I think we need to put the responsibility on the caller.  If it can provide information as to whether the conditions (or dampening) are updated then we can avoid this problem.  If not we have to assume there was a change.

This goes back to the 'purgeInternals' flag.  Since this flag should be set true when conditions or dampening rules are modified, it basically serves the purpose we need.  It may be better to rename this to 'updateConditions' or something, to be more clear at the calling level.

Comment 7 Jay Shaughnessy 2012-06-19 15:59:43 UTC

With respect to the idea of maintaining old condition logs.  The fundamental change to support this with today's infrastructure would be to keep around AlertConditions that are no longer tied to an AlertDefinition.  Instead, keep them around until the Alerts they are tied to are gone.  That means a tougher cleanup, as we'd have to look for orphans, no longer referenced by AlertConditionLogs.

Comment 8 Jay Shaughnessy 2012-06-20 19:56:25 UTC

master commit 3bbec0bbb45429d1f0769753466faa3e26e5623f

This is sort of complex. The underlying issue is that
AlertDefinitionManagerBean.updateAlertDefinition(updatedDef) replaces
existing AlertConditions with updatedDef.conditions. The detached
conditions were deleted (via Hibernate and DELETE_ORPHAN) regardless of the
fact that they very possibly linked to AlertConditionLog records for existing
Alerts. These log records are what you need to display the actual condition
evaluations that explain why the alert was fired.

The fix was to allow the detached AlertCondition records to stay around and
to then deal with the fallout of doing that:
- clean up orphaned AlertConditions (no def and no logs) in the data purge job
- update queries backing AlertManager.deleteAlertsByContext()
- update queries backing resource uninventory bulk delete

Additionally, there was a lot of confusion around the 'purgeInternals' parameter
for updateAlertDefinition(). It was initially thought that setting this true
was responsible for the loss of AlertConditions backing the Alerts. Actually,
what this flag does, when set to true, is to erase any partial condition
matching that has taken place. It must be set true if conditions, dampening rules, or the condition expression (All vs Any matching) are being updated. Otherwise it can be false.

I renamed this flag to be 'resetMatching' to hopefully be more clear, and
added some jdoc for the relevant methods.

I had already refactored CoreGUI's editor to correctly set this flag before
realizing it was not the root cause. So that logic is in place and
will optimize the update to some degree.

I think the backend logic can be further optimized using this setting more
intelligently. I plan a subsequent check-in for that before setting to ON_QA.

Comment 9 Jay Shaughnessy 2012-06-22 20:56:27 UTC

master commit 2c5974fbda6a05ab0d7d233c190740d45df067d5

Some follow-on work. Add optimization such that if conditions are not being
updated we don't detach the AlertCondition from the def, since they are
unchanged.
- Add related unit testing
- re-enable the tests in AlertConditionTest, which pass but for some
  reason were not enabled.
- Fix a few parameter names that I missed updating last time.


Moving to ON_QA.
Testing here should revolve around various types of alert updating and ensuring that Alerts don't lose their conditions logs, meaning the Alert should still be able to display the reasons it triggered.  Template and group level edits should be tested as well, although resource-level edits should be most intensive.

Comment 10 Heiko W. Rupp 2013-09-01 10:16:30 UTC

Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.