Bug 536197 (RHQ-573)
Summary: | refactor how and when alert definitions are created from their corresponding alert templates | ||
---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Joseph Marques <jmarques> |
Component: | Alerts | Assignee: | Jay Shaughnessy <jshaughn> |
Status: | CLOSED NEXTRELEASE | QA Contact: | Corey Welton <cwelton> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 1.0 | Keywords: | Improvement |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | All | ||
URL: | http://jira.rhq-project.org/browse/RHQ-573 | ||
Whiteboard: | |||
Fixed In Version: | 1.1 | Doc Type: | Enhancement |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | Type: | --- | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Joseph Marques
2008-06-12 01:43:00 UTC
One subtlety here is that InventoryManager, and for that matter, the agent side, I think, rightfully has no knowledge of alerting. By adding a call in InventoryManager.synchronizeInventory to update alert templates we would be exposing alerting to the plugin container/agent for a perceived performance gain. It seems a little messy. The easiest and cleanest approach would be to add the alert template processing during import into inventory. This keeps it all server-side. But, I believe this is what the original design was trying to avoid. It doesn't seem like too big a hit to me but if we are trying to minimize import time then any hit is significant. If we must actually call the agent to just call back to the server to process alert templates then a potential alternative is to add a new synchronization type, something like COMMIT which is basically a request from server to agent to call something like DiscoverServerServices.commitResource() as part of it's synch logic. This would in essence be a general callback such that the server could complete it's commit time logic. Which, would be at the moment, the application of alert templates. any suggestions on approach? ok, well, instead of exposing two server-side interfaces to the agent (one for the schedules, one for updating alerts): getMeasurementServerService().getLatestSchedulesForResourceId(int resourceId, boolean recursive/forDescendants); getAlertsServerService().applyTemplatesForResourceId(int resourceId, boolean recursive/forDescendants); Why don't we just have a single server-side interface, something like: getDiscoveryServerService().finishResourceCommit(int resourceId, boolean recursive/forDescendants); This facade will be the single point for developers to add more out-of-band commit-time logic. It would perform the same logic that getLatestSchedulesForResourceId does today, but would break up the schedule creation from the alert template stuff into separate method calls. However, whereas getLatestSchedulesForResourceId returns the schedule collection to the agent directly, this finishResourceCommit would have a void return type. The implementation of the method would be responsible for calling back into the agent, passing it the created schedules and other created objects, as necessary. another completely different solution might be to just remove the agent from the picture altogether. i know that the intention is to create objects for the resource hierarchy lazily so that the initial inventory commit action is speedy, but this doesn't necessarily have to be an agent-initiated flow. some performance improvements for the inventory merge can be done completely server-side. at a high-level, if we submitted a quartz job that triggered a few moments later out-of-band, then we could finish the alert creation without even talking/listening to the agent. At least the measurement schedule interface is useful to the agent, it needs to know measurement schedules. The alert interface is not relevant to the agent. I like the idea, though, of making explicit that a commit-time synch is taking place. This is not apparent at the moment, in the code, making one-time synch logic a bit non-obvious. I like the approach of sending scheds back as a message as opposed to as a return value. The quartz job approach is tempting but I think to not rock the boat too much I'll make changes to the current approach. Thanks. If we're targeting RHQ-322 for 1.1, then this should be 1.1 as well. I've reverted my propsed server-side changes back to the initial approach which basically just splits out alert template processing from schedule creation. I have made the alert template synch request explicit and have maintained the measurement schedule sync type. This gives us the most granularity for sync requests. As such I added an analogous syncAlertTemplatesRecursivley method to InventoryManager. The alert template sync should get requested only once for a resource tree whereas measurement schedule sync will continue to be requested any time there is a status change for a resource. The agent can perform the requested syncs as it sees fit for performance. It is important that the following is understood: A resource has to be merged into inventory prior to requesting schedule sync. schedule sync needs to be completed before applying templates. With respect to rhq-664 this returns us to where we were on Friday, the schedules and alert templates are not getting applied to service-level resources. This is due to the fact that at the moment the resource merge is done in layers but the schedule and template synch is done top down as soon as that first layer merge completes (after the initial blocking of performServiceScan in syncInventoryStatusRecursively). In other words, we're invoking sync for the children of the added roots prior to them being merged into inventory. It needs to happen after child services are discovered and merged. I'll update rhq-664 momentarily. I'm not yet setting this to resolved as further work may be pending. Depending on how we implement the necessary syncing for the child services we may need to protect against multiple alert template syncs for the same resource. r1220 : Changes to inventory sync with respect to actions taken for server requested sync and newlyCommitted resources as opposed to other sync use cases. server-side sync requests limited to status change and more agent-side logic used to determine actions to take, given more robust syncInfo agent-side. Maintains async approach to deferring schedule creation and alert template application for imported resources, to ensure best performance server-side for large imports. This may benefit the problem in rhq-322 since the logic behind schedule sync/update is affected by the overall sync logic. Although tested with the suggested QA steps in the next comment I'm not sending this to QA until we've had it running dev-side for a couple of days. Suggested minimal verification steps for QA: There are a bunch of code-paths to test. These tests exercise more resource sync than just alert template generation, for example, measurement schedule sync/update. The sync is somewhat comprehensive and it is important to make sure everything is playing well together. If these pass be creative to test potentially new scenarios. These instructions use Postgres server resources, both auto-discovered (local) and manaually added (remote). The remote postgres server properties I've used: jon03.qa.atl2.redhat.com 5432 postgres org.postgresql.Driver jon jon /home/test_jon/postgres/data 1) Install if necessary, start with a clean database 2) Start JON server 3) Start JON agent --clean 4) Create Alert Templates for the following Types and make sure to check the option to apply to resources in inventory. - Your platform - Postgres Server - Postgres Database - Postgres Database Table - RHQ Agent JVM (because it has no metrics) ** Note, the alerts don't actually have to fire, any alert will do 5) Import your platform, wait for all resources to show up in inventory 6) Ensure that the alert templates generated one and only one correct alert definition for relevant resources in inventory 7) Ensure metrics schedules exist for a sampling of imported resources and metrics are collected according to schedule by the agent. ** resources with and without alert defitions ** use the default schedules, an update will establish schedules via a different codepath 7) Uninventory the platform 8) Repeat steps 5 through 7 10) Manually add the remote postgres server to inventory 11) Ensure that the alert templates generated one and only one correct alert definition for the new relevant resources in inventory 12) Ensure metrics schedules exist and metrics are being collected by the agent for the new server and its services. 13) Uninventory the remote postgres server 14) Repeat step 10 - 12 15) Update a sampling of metrics schedules and ensure the agent responds correctly and adjusts the collection periods. 16) Restart the agent with --purgedata and see if everything re-establishes correctly, with the same inventory, no duplicate alert templates, and stable metric collection ** This just tests recovery in case of data directory loss Other ideas: - various agent-down situations - server-level AD import (uninv just a server and then re-import) - service level automatic import (add more auto-discovered services for servers already in inventory) Marking this as QA Verified -- inclusive of steps/variations listed in previous comment, I tried several different methods to end up with duplicate alerts, and was unable to do so. This bug was previously known as http://jira.rhq-project.org/browse/RHQ-573 This bug relates to RHQ-297 This bug relates to RHQ-322 |