Description of problem:
Have a (compatible or auto)group of 1000 resources or more.
Go to Monitor->Schedule tab and change the collection interval for 1 metric.
This will take many minutes or even time out depending on the group size and probably (and perhaps also depending on the agent availability).
This happens on PG and ORA
Created attachment 426904 [details]
Profiler output for comp group with 1200 resources
Created attachment 426905 [details]
Profiler output for auto group with 1200 resources (timed out after 1074 res)
Profiler output was for changing one schedule on 1200 resources - all on one agent
Profiler output shows we need to
- combine the update queries to larger batches of work
- not ping the agent for each schedule, but batch them together per agent and send then one batch per agent.
Heiko, that's strange. I recall making updates to this subsystem a few years ago to do just that, to batch all of the updates so that we only had to call out to each agent once. I'm going to investigate this and see when those additions were added, and why they aren't kicking in here.
OK, so I *did* correctly recall that I put logic in the SLSB to batch the updates to the agent, and I'm glad to see it's still in the MeasurementScheduleManager today. However, even through the raw ability to batch is there, it's not being used it as well as it could be.
Right now, the circle is artificially drawn around each resource. So even if you update ALL of the schedules for a single resource, they will all make it to the agent with a single request. But if you tried to update multiple resources at a time, each of those requests are separate call outs to the agent. Luckily, the way the API is written, this will only require minimal tweaks to correct and use the batching mechanism to its fullest extent.
I did not measure with many agents, but in the one agent case, the repeated round trips to the DB seem to kill performance.
The two attached screen shots from a profiling session show the weak spots.
Yes, these are two birds that will be killed with one stone. If we can change what is getting batched, then (the magnitude of) the number of calls out to the batch API will be reduced. This will cause the number of roundtrips to the DB to decrease by the same magnitude as the number of roundtrips to the agent.
Author: Joseph Marques <email@example.com>
Date: Sun Jun 27 18:15:39 2010 -0400
BZ-608057: improve performance of measurement schedules updates
* refactor workflow to batch updates per agent, not per resource
** if X resources across Y agents, fix yields (X-Y) LESS roundtrips to DB
** if X resources across Y agents, fix yields (X-Y) LESS roundtrips between server and agent
This bug has the exact same reproduction procedures as were detailed by Ian Springer here:
This bug was fixed in simultaneous conjunction with functional defects to this subsystem, detailed here:
Thus, verifying either of these bugs actually verifies both of them.
Heiko, aside from correctness which can be verified by QA, I'd like to see what kind of improvement this yield in the performance environment.
Mass-closure of verified bugs against JON.