Description of problem: in part of start vm investigation, the longest method which taking too long is getVdsToRunOn() = SchedulingManager.schedule() (see the breakdown tree) getVdsToRunOn() run 39 seconds in large scale setup. pre load some of the maps or using different query in order to get smaller maps or lists to runs the filters on, may enhance the performance. Version-Release number of selected component (if applicable): 3.4.2 How reproducible: 100% Steps to Reproduce: 1. 500 hosts 10K vms setup 2. single start vm runs ~80 sec (using vdsmfake) Actual results: getVdsToRunOn running time ~30+ sec. Expected results: faster results less then few seconds. Additional info: RunVmCommand hotspot
Eldad, can you please specify which scheduling policy did you use and if it was a default one or something you created?
schedule policy 'Evenly Distributed' (default).
there is a problem running any query to the DB in his env. the scheduling process simply has many queries, that's why it takes too long.
We are monitor db queries which longer then 3 sec currently there is just 1 function which takes long and its getdisckvmguid Is related to scheduling? what are the queries scheduling runs?
Gilad as we chat offline, the following store proc 'getVdsForVdsGroupWithStatus' is not the problem. store procedure run time is 20 millie second's.
Since several things have changed, I suggest we revisit this for 3.6.
Created attachment 1127964 [details] 224 hosts, start 50 vms unpatched
Created attachment 1127965 [details] 324 host, start 50 VMs unpatched
Created attachment 1127966 [details] 324 hosts, start 50 VMs with patch 53644
Created attachment 1136501 [details] Start 50 VMs, 10 in parallel, 300 hosts, no fix
Created attachment 1136502 [details] Start 50 VMs, 10 in parallel, 300 hosts, fix 53644
Created attachment 1136503 [details] Start 50 VMs, 10 in parallel, 300 hosts, fixes 53644, 54720
Created attachment 1136521 [details] Start 50 VMs, ALL in parallel, 300 hosts, fixes 53644, 54720
Created attachment 1136562 [details] Start 50 VMs, 10 in parallel, 300 hosts, fixes 53644, 54720
Created attachment 1136578 [details] Start 50 VMs, SEQUENTIAL, 300 hosts, fixes 53644, 54720
Roman lets summarise this, your fixes are improving the critical section of the schedule() method. We need to merge this. If your finding is that the call to VdsBrokerCommands, namely comminication with VDS is slow because of ovirt-fakevdsm then we will all the fixes above mentioned and move to MOD. Any other issues on will require new bug.
the query outside of the lock needs bit more attention. postponing to 3.6.7.
Here are the final results: 1) NO FIX APPLIED: 300 hosts, 50 VMs need to be scheduled, max. 10 in parallel The image in [3] shows: Mean execution time: 8.8 seconds Median execution time: 9 seconds 99th percentile: 16 seconds Request rate for the whole RunVmCommand: 0.7 requests/second When we calculate the execution time without the mutex wait time we get an average execution time of 1.24 seconds which is about 0.8 requests per second. In [3] we see that we had an overall throughput of 0.7 requests/second. The two request rates are pretty close so it is hard to tell in this scenario if the scheduler is the bottleneck. We will see that it is not when we apply the first patch [1]. 2) FIX [1] APPLIED: 300 hosts, 50 VMs need to be scheduled, max. 10 in parallel Image [4] shows: Mean execution time: 2.7 seconds Median execution time: 2.6 seconds 99th percentile: 6 seconds Request rate for the whole RunVmCommand: 0.7 requests/second When we calculate the execution time without the mutex wait time we get an average execution time of 0.38 seconds which is about 2.6 requests per second. The overall request rate for the RunVmCommand is still 0.7 requests/second. This proves that the scheduler can't be the bottleneck. 3) FIX [1] AND [2] APPLIED: 300 hosts, 50 VMs need to be scheduled, max. 10 in parallel Image [4] shows: Mean execution time: 0.92 seconds Median execution time: 0.86 seconds 99th percentile: 2.21 seconds Request rate for the whole RunVmCommand: 0.7 requests/second Here with the mutex wait time excluded, the average execution time is 0.13 seconds which is about 7.7 requests per second. The overall request rate for the RunVmCommand is still 0.7 requests/second. To sum it up, with the already merged patch [1] the scheduler is about 4 times faster and we remove some DB call loops. With [2] applied it is about 10 times faster but it also requires some architectural changes which are too risky for 3.6 from my perspective. Even when the scheduler was 10 times faster we always had a throughput of 0.7 requests/second which proofs that the scheduler is not the cause of the problem. [1] http://gerrit.ovirt.org/54720 [2] http://gerrit.ovirt.org/53644 [3] https://bugzilla.redhat.com/attachment.cgi?id=1136501 [4] https://bugzilla.redhat.com/attachment.cgi?id=1136502
verified on top of 3.6.8 500 hosts with 9.9K vms. timing: 2016-07-18 09:45:49,694 INFO [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-43) [9654d5f] Lock Acquired to object 'EngineLock:{exclusiveLocks='[fbe5b76d-7de8-4f24-b232-d0f764302ed9=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2016-07-18 09:45:58,228 INFO [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-6-thread-45) [9654d5f] Lock freed to object 'EngineLock:{exclusiveLocks='[fbe5b76d-7de8-4f24-b232-d0f764302ed9=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' seems like 10 times faster. seems like we can speed it up even more with: https://gerrit.ovirt.org/#/c/54720/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1507.html