Description of problem: When using 1000's of disks based on same template, invoking FileStorageDomain.getAllVolumes is consume too much cpu time, slowing down vm startup and recovery flows. From initial profile on host with 24 cores running 20 vms, we can see that getAllVolumes dominates the profile. Note that each call to getAllVolumes took 39 seconds. [root@host06-rack04 ~]# ./profile-stats -c -scumulative vdsmd.prof Sun Jan 4 17:39:11 2015 vdsmd.prof 24867930 function calls (25536205 primitive calls) in 1450.530 CPU seconds Ordered by: cumulative time List reduced from 2723 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 89868/89869 361.385 0.004 361.385 0.004 fileSD.py:444(<genexpr>) 117/118 0.026 0.000 359.611 3.048 dispatcher.py:65(wrapper) 117/118 0.010 0.000 359.325 3.045 task.py:99(Task.wrapper) 117/118 0.038 0.000 359.315 3.045 task.py:1155(Task.prepare) 117/118 0.005 0.000 355.929 3.016 task.py:869(Task._run) 117/118 0.061 0.001 355.902 3.016 logUtils.py:40(wrapper) 8/9 0.000 0.000 353.487 39.276 vm.py:2339(Vm.preparePaths) 24/25 0.001 0.000 353.485 14.139 clientIF.py:267(clientIF.prepareVolumePath) 8/9 0.057 0.007 353.311 39.257 hsm.py:3188(HSM.prepareImage) 8/9 0.000 0.000 352.837 39.204 fileSD.py:414(NfsStorageDomain.getAllVolumes) Ordered by: cumulative time List reduced from 2723 to 20 due to restriction <20> Function was called by... ncalls tottime cumtime fileSD.py:444(<genexpr>) <- 89869/89868 361.385 361.385 fileSD.py:414(NfsStorageDomain.getAllVolumes) Version-Release number of selected component (if applicable): Tested with vt13.5, but the relevant code was introduce in 2012. How reproducible: Always Steps to Reproduce: 1. Create file based storage domain with 3000 disks 2. Run 20-40 vms Actual results: Flows invoking getAllVolumes are very slow, and vdsm cpu usage is higher then it should be Expected results: getAllVolume should scale to big number of disks Testing: - Measure the time to start a vm with and without this patch - how much time it take until the vm is in "up" state in the engine? - Measure the time to recover a vm when restarting vdsm - how much time the it takes until the vm is back in "up" state? This bug was spawned from bug 1177634.
bug verified on top based file storage domain with ~7K images. version 3.6.3 fileSD has 43 calls vs ~8K and CPU time ~49.3 vs 361. profile output: Sun Feb 14 22:52:27 2016 /home/emarcian/Desktop/vdsm_3_6_withcpuaffinity_1800sec.prof 24753697 function calls (26521217 primitive calls) in 159.168 seconds Ordered by: cumulative time List reduced from 2301 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 1284/1337 0.041 0.000 100.695 0.075 threading.py:753(Thread.run) 120/148 0.002 0.000 91.835 0.621 utils.py:733(wrapper) 1552 0.037 0.000 58.778 0.038 dispatcher.py:65(wrapper) 1552 0.015 0.000 58.226 0.038 task.py:100(Task.wrapper) 1552 0.048 0.000 58.211 0.038 task.py:1155(Task.prepare) 48459/48471 0.727 0.000 57.438 0.001 executor.py:182(_Worker._execute_task) 1552 0.011 0.000 54.275 0.035 task.py:869(Task._run) 1552 0.185 0.000 54.265 0.035 logUtils.py:44(wrapper) 4 0.006 0.001 51.891 12.973 utils.py:930(retry) 1 0.000 0.000 51.882 51.882 clientIF.py:452(clientIF._recoverThread) 1 0.004 0.004 51.876 51.876 clientIF.py:459(clientIF._recoverExistingVms) 45 0.001 0.000 50.515 1.123 vm.py:774(Vm.preparePaths) 90 0.003 0.000 50.513 0.561 clientIF.py:307(clientIF.prepareVolumePath) 45 0.290 0.006 50.170 1.115 hsm.py:3195(HSM.prepareImage) 45 7.606 0.169 49.325 1.096 fileSD.py:432(NfsStorageDomain.getAllVolumes) 118 0.012 0.000 39.895 0.338 xmlrpc.py:82(SimpleThreadedXMLRPCServer._process_requests) 118 0.004 0.000 39.684 0.336 SocketServer.py:643(RequestHandler.__init__) 118 0.002 0.000 39.642 0.336 BaseHTTPServer.py:336(RequestHandler.handle) 236 0.008 0.000 39.640 0.168 BaseHTTPServer.py:301(RequestHandler.handle_one_request) 118 0.035 0.000 39.528 0.335 SimpleXMLRPCServer.py:467(RequestHandler.do_POST)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html