Red Hat Bugzilla – Bug 830463
Re-enabling (actually, updating) an alert definition clears condition logs for existing alerts
Last modified: 2013-09-01 06:16:30 EDT
Have an alert fire and disable on fire.
Then when the alert has fired, check the condition log in the UI.
Then re-enable the alert definition and check the condition log for the existing alert again - they are gone.
Problem in the source is that somewhere the UI calls
with the last parameter, which is called in the interface "update internals" ,
In reality that is purging the existing condition logs, which should only happen (if at all) when the conditions change.
Interestingly enough, in the implementation class the parameter is called "purge internals"
To reproduce it is important to edit the alert definition to mark it as active again.
The call in the UI starts here:
which then adds the hardcoded 'true' to the "purgeInternals" call.
I set this to urgent, as the "clear condition logs" also shows up when the user is e.g. adding a new notification.
The issue here is not enabling necessarily because that does not by itself cause the issue. If you enable/disable using the buttons on the alert def list view this will not happen.
The issue is editing the alert def in any way using edit mode in the gui (i.e. selecting the alert def detail view, clicking the 'Edit' button and subsequently the 'Save' button.
This mechanism is not selectively setting the 'purge internals' option but rather setting it for all updates.
I believe, as Heiko says above, that only a change to the condition set should force this option.
Looking into doing this...
(In reply to comment #3)
> The issue here is not enabling necessarily because that does not by itself
> cause the issue. If you enable/disable using the buttons on the alert def
> list view this will not happen.
Yes. The enable-via-edit was just the first case where I found it.
> I believe, as Heiko says above, that only a change to the condition set
> should force this option.
And even here we should be more clever in the sense that we do not remove the old condition logs, but "freeze" them in a way that it is obvious that the data is from an outdated condition set, but still available for reference. Currently we show nothing, which makes users wonder why the alert did fire at all.
As the condition logs are also exposed at least on the REST interface, this would mean SLSB changes.
Additionally, changes to dampening also rightly invalidate the existing condition logs
I'm not sure yet about the suggestion above about being able to maintain old condition logs in the face of a change to the condition set or dampening. We may have fundamental issues providing this feature. It requires investigation. I'm still concentrating on part 1 at the moment, which is to keep the condition logs when they are still valid.
OK, the problem is deeper than I thought. It turns out that any time an alert def is updated that we blindly update all of conditions, dampening, notifications, etc. Regardless of whether they actually changed. Removing the conditions cascade deletes the condition logs (as it should).
To avoid this, I think we need to put the responsibility on the caller. If it can provide information as to whether the conditions (or dampening) are updated then we can avoid this problem. If not we have to assume there was a change.
This goes back to the 'purgeInternals' flag. Since this flag should be set true when conditions or dampening rules are modified, it basically serves the purpose we need. It may be better to rename this to 'updateConditions' or something, to be more clear at the calling level.
With respect to the idea of maintaining old condition logs. The fundamental change to support this with today's infrastructure would be to keep around AlertConditions that are no longer tied to an AlertDefinition. Instead, keep them around until the Alerts they are tied to are gone. That means a tougher cleanup, as we'd have to look for orphans, no longer referenced by AlertConditionLogs.
master commit 3bbec0bbb45429d1f0769753466faa3e26e5623f
This is sort of complex. The underlying issue is that
existing AlertConditions with updatedDef.conditions. The detached
conditions were deleted (via Hibernate and DELETE_ORPHAN) regardless of the
fact that they very possibly linked to AlertConditionLog records for existing
Alerts. These log records are what you need to display the actual condition
evaluations that explain why the alert was fired.
The fix was to allow the detached AlertCondition records to stay around and
to then deal with the fallout of doing that:
- clean up orphaned AlertConditions (no def and no logs) in the data purge job
- update queries backing AlertManager.deleteAlertsByContext()
- update queries backing resource uninventory bulk delete
Additionally, there was a lot of confusion around the 'purgeInternals' parameter
for updateAlertDefinition(). It was initially thought that setting this true
was responsible for the loss of AlertConditions backing the Alerts. Actually,
what this flag does, when set to true, is to erase any partial condition
matching that has taken place. It must be set true if conditions, dampening rules, or the condition expression (All vs Any matching) are being updated. Otherwise it can be false.
I renamed this flag to be 'resetMatching' to hopefully be more clear, and
added some jdoc for the relevant methods.
I had already refactored CoreGUI's editor to correctly set this flag before
realizing it was not the root cause. So that logic is in place and
will optimize the update to some degree.
I think the backend logic can be further optimized using this setting more
intelligently. I plan a subsequent check-in for that before setting to ON_QA.
master commit 2c5974fbda6a05ab0d7d233c190740d45df067d5
Some follow-on work. Add optimization such that if conditions are not being
updated we don't detach the AlertCondition from the def, since they are
- Add related unit testing
- re-enable the tests in AlertConditionTest, which pass but for some
reason were not enabled.
- Fix a few parameter names that I missed updating last time.
Moving to ON_QA.
Testing here should revolve around various types of alert updating and ensuring that Alerts don't lose their conditions logs, meaning the Alert should still be able to display the reasons it triggered. Template and group level edits should be tested as well, although resource-level edits should be most intensive.
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.