Bug 608057

Summary: Perf: Update of measurement schedule for comp/auto group very slow
Product: [Other] RHQ Project Reporter: Heiko W. Rupp <hrupp>
Component: Core ServerAssignee: Joseph Marques <jmarques>
Status: CLOSED CURRENTRELEASE QA Contact: Corey Welton <cwelton>
Severity: high Docs Contact:
Priority: low    
Version: 3.0.0CC: jmarques
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: 2.4 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-08-12 16:59:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Profiler output for comp group with 1200 resources
none
Profiler output for auto group with 1200 resources (timed out after 1074 res) none

Description Heiko W. Rupp 2010-06-25 14:31:53 UTC
Description of problem:

Have a (compatible or auto)group of 1000 resources or more.
Go to Monitor->Schedule tab and change the collection interval for 1 metric.

This will take many minutes or even time out depending on the group size and probably (and perhaps also depending on the agent availability).

This happens on PG and ORA

Comment 1 Heiko W. Rupp 2010-06-25 14:53:03 UTC
Created attachment 426904 [details]
Profiler output for comp group with 1200 resources

Comment 2 Heiko W. Rupp 2010-06-25 15:06:45 UTC
Created attachment 426905 [details]
Profiler output for auto group with 1200 resources (timed out after 1074 res)

Comment 3 Heiko W. Rupp 2010-06-25 15:08:38 UTC
Profiler output was for changing one schedule on 1200 resources - all on one agent


Profiler output shows we need to
- combine the update queries to larger batches of work
- not ping the agent for each schedule, but batch them together per agent and send then one batch per agent.

Comment 4 Joseph Marques 2010-06-25 18:10:34 UTC
Heiko, that's strange.  I recall making updates to this subsystem a few years ago to do just that, to batch all of the updates so that we only had to call out to each agent once.  I'm going to investigate this and see when those additions were added, and why they aren't kicking in here.

Comment 5 Joseph Marques 2010-06-25 18:41:10 UTC
OK, so I *did* correctly recall that I put logic in the SLSB to batch the updates to the agent, and I'm glad to see it's still in the MeasurementScheduleManager today.  However, even through the raw ability to batch is there, it's not being used it as well as it could be.

Right now, the circle is artificially drawn around each resource.  So even if you update ALL of the schedules for a single resource, they will all make it to the agent with a single request.  But if you tried to update multiple resources at a time, each of those requests are separate call outs to the agent.  Luckily, the way the API is written, this will only require minimal tweaks to correct and use the batching mechanism to its fullest extent.

Comment 6 Heiko W. Rupp 2010-06-25 18:50:20 UTC
I did not measure with many agents, but in the one agent case, the repeated round trips to the DB seem to kill performance. 
The two attached screen shots from a profiling session show the weak spots.

Comment 7 Joseph Marques 2010-06-25 18:57:01 UTC
Yes, these are two birds that will be killed with one stone.  If we can change what is getting batched, then (the magnitude of) the number of calls out to the batch API will be reduced.  This will cause the number of roundtrips to the DB to decrease by the same magnitude as the number of roundtrips to the agent.

Comment 8 Joseph Marques 2010-06-27 22:18:42 UTC
commit 2d3f784933a13f1049b78186adacf3a88ae60e58
Author: Joseph Marques <joseph>
Date:   Sun Jun 27 18:15:39 2010 -0400

    BZ-608057: improve performance of measurement schedules updates
    
    * refactor workflow to batch updates per agent, not per resource
    ** if X resources across Y agents, fix yields (X-Y) LESS roundtrips to DB
    ** if X resources across Y agents, fix yields (X-Y) LESS roundtrips between server and agent

Comment 9 Joseph Marques 2010-06-27 22:24:11 UTC
This bug has the exact same reproduction procedures as were detailed by Ian Springer here:

https://bugzilla.redhat.com/show_bug.cgi?id=535283#c3

This bug was fixed in simultaneous conjunction with functional defects to this subsystem, detailed here:

https://bugzilla.redhat.com/show_bug.cgi?id=608487

Thus, verifying either of these bugs actually verifies both of them.

----

Heiko, aside from correctness which can be verified by QA, I'd like to see what kind of improvement this yield in the performance environment.

Comment 10 Corey Welton 2010-07-09 13:58:12 UTC
QA Closing/Verified.

Comment 11 Corey Welton 2010-08-12 16:59:52 UTC
Mass-closure of verified bugs against JON.