Bug 536197 (RHQ-573)

Summary: refactor how and when alert definitions are created from their corresponding alert templates
Product: [Other] RHQ Project Reporter: Joseph Marques <jmarques>
Component: AlertsAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED NEXTRELEASE QA Contact: Corey Welton <cwelton>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.0Keywords: Improvement
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
URL: http://jira.rhq-project.org/browse/RHQ-573
Whiteboard:
Fixed In Version: 1.1 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joseph Marques 2008-06-12 01:43:00 UTC
As a performance improvement, the inventory merge process was trimmed down to only persist a core amount of data the first time the resource is reported up to the server in an inventory report.  As a consequence, other parts of the domain may be created in a separate, agent-initiated transaction - notably measurement schedules.

core/plugin-container -- InventoryManager.synchronizeInventory(int resourceId, EnumSet<SynchronizationType> synchronizationTypes)

calls out to:

getMeasurementServerService().getLatestSchedulesForResourceId(int resourceId, boolean recursive/forDescendants);

From the agent-side calling context, it just looks like this InventoryManager is making a call into the server to create the measurement schedules in an out-of-band way from the regular inventory merge workflow.  However, this method has an interesting side-effect, namely that it looks at the given resources' alert templates and creates alert definitions as appropriately.

Aside from this creating a small bug - resources whose types don't have measurement definition never have alert templates applied to them during inventory merge - this is just a sloppy side-effect.  The ideal solution would be to extract the application of alert templates to new resources into its own server-side, recursive/forDescendants method.  Then it could be called as the very last statement in InventoryManager.synchronizeInventory, which will then create the alert definition regardless of whether the passed resource(s) have measurement schedules created or not.

Comment 1 Jay Shaughnessy 2008-07-07 20:32:56 UTC
One subtlety here is that InventoryManager, and for that matter, the agent side, I think, rightfully has no knowledge of alerting.  By adding a call in InventoryManager.synchronizeInventory to update alert templates we would be exposing alerting to the plugin container/agent for a perceived performance gain.  It seems a little messy.

The easiest and cleanest approach would be to add the alert template processing during import into inventory.  This keeps it all server-side.  But, I believe this is what the original design was trying to avoid. It doesn't seem like too big a hit to me but if we are trying to minimize import time then any hit is significant.

If we must actually call the agent to just call back to the server to process alert templates then a potential alternative is to add a new synchronization type, something like COMMIT which is basically a request from server to agent to call something like DiscoverServerServices.commitResource() as part of it's synch logic.  This would in essence be a general callback such that the server could complete it's commit time logic. Which, would be at the moment, the application of alert templates.

any suggestions on approach?

Comment 2 Joseph Marques 2008-07-07 21:01:29 UTC
ok, well, instead of exposing two server-side interfaces to the agent (one for the schedules, one for updating alerts):

getMeasurementServerService().getLatestSchedulesForResourceId(int resourceId, boolean recursive/forDescendants); 
getAlertsServerService().applyTemplatesForResourceId(int resourceId, boolean recursive/forDescendants);

Why don't we just have a single server-side interface, something like:

getDiscoveryServerService().finishResourceCommit(int resourceId, boolean recursive/forDescendants);

This facade will be the single point for developers to add more out-of-band commit-time logic.  It would perform the same logic that getLatestSchedulesForResourceId does today, but would break up the schedule creation from the alert template stuff into separate method calls.  However, whereas getLatestSchedulesForResourceId returns the schedule collection to the agent directly, this finishResourceCommit would have a void return type.  The implementation of the method would be responsible for calling back into the agent, passing it the created schedules and other created objects, as necessary.

Comment 3 Joseph Marques 2008-07-07 21:04:34 UTC
another completely different solution might be to just remove the agent from the picture altogether.  i know that the intention is to create objects for the resource hierarchy lazily so that the initial inventory commit action is speedy, but this doesn't necessarily have to be an agent-initiated flow.  some performance improvements for the inventory merge can be done completely server-side.  at a high-level, if we submitted a quartz job that triggered a few moments later out-of-band, then we could finish the alert creation without even talking/listening to the agent.  

Comment 4 Jay Shaughnessy 2008-07-07 21:33:05 UTC
At least the measurement schedule interface is useful to the agent, it needs to know measurement schedules.  The alert interface is not relevant to the agent.  I like the idea, though, of making explicit that a commit-time synch is taking place.  This is not apparent at the moment, in the code, making one-time synch logic a bit non-obvious.  I like the approach of sending scheds back as a message as opposed to as a return value.  The quartz job approach is tempting but I think to not rock the boat too much I'll make changes to the current approach.  Thanks.


Comment 5 Joseph Marques 2008-07-08 00:37:13 UTC
If we're targeting RHQ-322 for 1.1, then this should be 1.1 as well.

Comment 6 Jay Shaughnessy 2008-07-15 20:40:47 UTC
I've reverted my propsed server-side changes back to the initial approach which basically just splits out alert template processing from schedule creation. I have made the alert template synch request explicit and have maintained the measurement schedule sync type.  This gives us the most granularity for sync requests.  As such I added an analogous syncAlertTemplatesRecursivley method to InventoryManager.  The alert template sync should get requested only once for a resource tree whereas measurement schedule sync will continue to be requested any time there is a status change for a resource.  The agent can perform the requested syncs as it sees fit for performance.

It is important that the following is understood: A resource has to be merged into inventory prior to requesting schedule sync. schedule sync needs to be completed before applying templates.

With respect to rhq-664 this returns us  to where we were on Friday, the schedules and alert templates are not getting applied to service-level resources. This is due to the fact that at the moment the resource merge is done in layers but the schedule and template synch is done top down as soon as that first layer merge completes (after the initial blocking of performServiceScan in syncInventoryStatusRecursively).  In other words, we're invoking sync for the children of the added roots prior to them being merged into inventory. It needs to happen after child services are discovered and merged. I'll update rhq-664 momentarily.

I'm not yet setting this to resolved as further work may be pending. Depending on how we implement the necessary syncing for the child services we may need to protect against multiple alert template syncs for the same resource.



Comment 7 Jay Shaughnessy 2008-08-11 19:17:35 UTC
r1220 : Changes to inventory sync with respect to actions taken for server requested sync and newlyCommitted resources as opposed to other sync use cases.  server-side sync requests limited to status change and more agent-side logic used to determine actions to take, given more robust syncInfo agent-side. Maintains async approach to deferring schedule creation and alert template application for imported resources, to ensure best performance server-side for large imports.

This may benefit the problem in rhq-322 since the logic behind schedule sync/update is affected by the overall sync logic.

Although tested with the suggested QA steps in the next comment I'm not sending this to QA until we've had it running dev-side for a couple of days.

Comment 8 Jay Shaughnessy 2008-08-11 19:20:17 UTC
Suggested minimal verification steps for QA:

There are a bunch of code-paths to test.  These tests exercise more resource sync than just alert template generation, for example, measurement schedule sync/update.  The sync is somewhat comprehensive and it is important to make sure everything is playing well together.  If these pass be creative to test potentially new scenarios.

These instructions use Postgres server resources, both auto-discovered (local) and manaually added (remote). The
remote postgres server properties I've used:

jon03.qa.atl2.redhat.com
5432
postgres
org.postgresql.Driver
jon
jon
/home/test_jon/postgres/data


1) Install if necessary, start with a clean database

2) Start JON server

3) Start JON agent --clean

4) Create Alert Templates for the following Types and make sure to check the option to apply to resources in inventory.
	- Your platform
	- Postgres Server
	- Postgres Database
	- Postgres Database Table
	- RHQ Agent JVM (because it has no metrics)
	
	** Note, the alerts don't actually have to fire, any alert will do

5) Import your platform, wait for all resources to show up in inventory
	
6) Ensure that the alert templates generated one and only one correct alert definition for relevant resources in inventory

7) Ensure metrics schedules exist for a sampling of imported resources and metrics are collected according to schedule by the agent.

	** resources with and without alert defitions
	** use the default schedules, an update will establish schedules via a different codepath

7) Uninventory the platform

8) Repeat steps 5 through 7

10) Manually add the remote postgres server to inventory

11) Ensure that the alert templates generated one and only one correct alert definition for the new relevant resources in inventory

12) Ensure metrics schedules exist and metrics are being collected by the agent for the new server and its services.

13) Uninventory the remote postgres server

14) Repeat step 10 - 12

15) Update a sampling of metrics schedules and ensure the agent responds correctly and adjusts the collection periods.

16) Restart the agent with --purgedata and see if everything re-establishes correctly, with the same inventory, no duplicate alert templates, and stable metric collection

	** This just tests recovery in case of data directory loss

Other ideas:

- various agent-down situations

- server-level AD import (uninv just a server and then re-import)

- service level automatic import (add more auto-discovered services for servers already in inventory)

Comment 9 Corey Welton 2008-09-04 17:39:32 UTC
Marking this as QA Verified -- inclusive of steps/variations listed in previous comment, I tried several different methods to end up with duplicate alerts, and was unable to do so.

Comment 10 Red Hat Bugzilla 2009-11-10 21:12:10 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-573
This bug relates to RHQ-297
This bug relates to RHQ-322