Bug 830463
Summary: | Re-enabling (actually, updating) an alert definition clears condition logs for existing alerts | |||
---|---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Heiko W. Rupp <hrupp> | |
Component: | Alerts | Assignee: | Jay Shaughnessy <jshaughn> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | |
Severity: | high | Docs Contact: | ||
Priority: | urgent | |||
Version: | 4.4 | CC: | hrupp, jshaughn | |
Target Milestone: | --- | |||
Target Release: | RHQ 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 833090 (view as bug list) | Environment: | ||
Last Closed: | 2013-09-01 10:16:30 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 833090 |
Description
Heiko W. Rupp
2012-06-09 22:04:34 UTC
To reproduce it is important to edit the alert definition to mark it as active again. The call in the UI starts here: SingleAlertDefinitionView:145 -> org.rhq.enterprise.gui.coregui.client.alert.definitions.ResourceAlertDefinitionsView#commitAlertDefinition which then adds the hardcoded 'true' to the "purgeInternals" call. I set this to urgent, as the "clear condition logs" also shows up when the user is e.g. adding a new notification. The issue here is not enabling necessarily because that does not by itself cause the issue. If you enable/disable using the buttons on the alert def list view this will not happen. The issue is editing the alert def in any way using edit mode in the gui (i.e. selecting the alert def detail view, clicking the 'Edit' button and subsequently the 'Save' button. This mechanism is not selectively setting the 'purge internals' option but rather setting it for all updates. I believe, as Heiko says above, that only a change to the condition set should force this option. Looking into doing this... (In reply to comment #3) > The issue here is not enabling necessarily because that does not by itself > cause the issue. If you enable/disable using the buttons on the alert def > list view this will not happen. Yes. The enable-via-edit was just the first case where I found it. > I believe, as Heiko says above, that only a change to the condition set > should force this option. And even here we should be more clever in the sense that we do not remove the old condition logs, but "freeze" them in a way that it is obvious that the data is from an outdated condition set, but still available for reference. Currently we show nothing, which makes users wonder why the alert did fire at all. As the condition logs are also exposed at least on the REST interface, this would mean SLSB changes. Additionally, changes to dampening also rightly invalidate the existing condition logs I'm not sure yet about the suggestion above about being able to maintain old condition logs in the face of a change to the condition set or dampening. We may have fundamental issues providing this feature. It requires investigation. I'm still concentrating on part 1 at the moment, which is to keep the condition logs when they are still valid. OK, the problem is deeper than I thought. It turns out that any time an alert def is updated that we blindly update all of conditions, dampening, notifications, etc. Regardless of whether they actually changed. Removing the conditions cascade deletes the condition logs (as it should). To avoid this, I think we need to put the responsibility on the caller. If it can provide information as to whether the conditions (or dampening) are updated then we can avoid this problem. If not we have to assume there was a change. This goes back to the 'purgeInternals' flag. Since this flag should be set true when conditions or dampening rules are modified, it basically serves the purpose we need. It may be better to rename this to 'updateConditions' or something, to be more clear at the calling level. With respect to the idea of maintaining old condition logs. The fundamental change to support this with today's infrastructure would be to keep around AlertConditions that are no longer tied to an AlertDefinition. Instead, keep them around until the Alerts they are tied to are gone. That means a tougher cleanup, as we'd have to look for orphans, no longer referenced by AlertConditionLogs. master commit 3bbec0bbb45429d1f0769753466faa3e26e5623f This is sort of complex. The underlying issue is that AlertDefinitionManagerBean.updateAlertDefinition(updatedDef) replaces existing AlertConditions with updatedDef.conditions. The detached conditions were deleted (via Hibernate and DELETE_ORPHAN) regardless of the fact that they very possibly linked to AlertConditionLog records for existing Alerts. These log records are what you need to display the actual condition evaluations that explain why the alert was fired. The fix was to allow the detached AlertCondition records to stay around and to then deal with the fallout of doing that: - clean up orphaned AlertConditions (no def and no logs) in the data purge job - update queries backing AlertManager.deleteAlertsByContext() - update queries backing resource uninventory bulk delete Additionally, there was a lot of confusion around the 'purgeInternals' parameter for updateAlertDefinition(). It was initially thought that setting this true was responsible for the loss of AlertConditions backing the Alerts. Actually, what this flag does, when set to true, is to erase any partial condition matching that has taken place. It must be set true if conditions, dampening rules, or the condition expression (All vs Any matching) are being updated. Otherwise it can be false. I renamed this flag to be 'resetMatching' to hopefully be more clear, and added some jdoc for the relevant methods. I had already refactored CoreGUI's editor to correctly set this flag before realizing it was not the root cause. So that logic is in place and will optimize the update to some degree. I think the backend logic can be further optimized using this setting more intelligently. I plan a subsequent check-in for that before setting to ON_QA. master commit 2c5974fbda6a05ab0d7d233c190740d45df067d5 Some follow-on work. Add optimization such that if conditions are not being updated we don't detach the AlertCondition from the def, since they are unchanged. - Add related unit testing - re-enable the tests in AlertConditionTest, which pass but for some reason were not enabled. - Fix a few parameter names that I missed updating last time. Moving to ON_QA. Testing here should revolve around various types of alert updating and ensuring that Alerts don't lose their conditions logs, meaning the Alert should still be able to display the reasons it triggered. Template and group level edits should be tested as well, although resource-level edits should be most intensive. Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since. |