Bug 1179950

Summary:	[scale] Excessive cpu usage in FileStorageDomain.getAllVolumes
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Nir Soffer <nsoffer>
Component:	vdsm	Assignee:	Nir Soffer <nsoffer>
Status:	CLOSED ERRATA	QA Contact:	Eldad Marciano <emarcian>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.5.0	CC:	aefrat, amureini, bazulay, emarcian, lpeer, lsurette, rbinkhor, tnisan, yeylon, ykaul, ylavi
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1223053 (view as bug list)		Environment:
Last Closed:	2016-03-09 19:28:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Scale	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1177634, 1223053

Description Nir Soffer 2015-01-07 22:09:04 UTC

Description of problem:
When using 1000's of disks based on same template, invoking
FileStorageDomain.getAllVolumes is consume too much cpu time, slowing down
vm startup and recovery flows.

From initial profile on host with 24 cores running 20 vms, we can see that 
getAllVolumes dominates the profile. Note that each call to getAllVolumes
took 39 seconds.

[root@host06-rack04 ~]# ./profile-stats -c -scumulative vdsmd.prof 
Sun Jan  4 17:39:11 2015    vdsmd.prof

         24867930 function calls (25536205 primitive calls) in 1450.530 CPU seconds

   Ordered by: cumulative time
   List reduced from 2723 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
89868/89869  361.385    0.004  361.385    0.004 fileSD.py:444(<genexpr>)
  117/118    0.026    0.000  359.611    3.048 dispatcher.py:65(wrapper)
  117/118    0.010    0.000  359.325    3.045 task.py:99(Task.wrapper)
  117/118    0.038    0.000  359.315    3.045 task.py:1155(Task.prepare)
  117/118    0.005    0.000  355.929    3.016 task.py:869(Task._run)
  117/118    0.061    0.001  355.902    3.016 logUtils.py:40(wrapper)
      8/9    0.000    0.000  353.487   39.276 vm.py:2339(Vm.preparePaths)
    24/25    0.001    0.000  353.485   14.139 clientIF.py:267(clientIF.prepareVolumePath)
      8/9    0.057    0.007  353.311   39.257 hsm.py:3188(HSM.prepareImage)
      8/9    0.000    0.000  352.837   39.204 fileSD.py:414(NfsStorageDomain.getAllVolumes)


   Ordered by: cumulative time
   List reduced from 2723 to 20 due to restriction <20>

Function                                       was called by...
                                                   ncalls  tottime  cumtime
fileSD.py:444(<genexpr>)                       <- 89869/89868  361.385  361.385  fileSD.py:414(NfsStorageDomain.getAllVolumes)
Version-Release number of selected component (if applicable):
Tested with vt13.5, but the relevant code was introduce in 2012.

How reproducible:
Always

Steps to Reproduce:
1. Create file based storage domain with 3000 disks
2. Run 20-40 vms

Actual results:
Flows invoking getAllVolumes are very slow, and vdsm cpu usage is higher then it should be

Expected results:
getAllVolume should scale to big number of disks

Testing:

- Measure the time to start a vm with and without this patch - how much time
  it take until the vm is in "up" state in the engine?
- Measure the time to recover a vm when restarting vdsm - how much time the
  it takes until the vm is back in "up" state?

This bug was spawned from bug 1177634.

Comment 4 Eldad Marciano 2016-02-14 20:59:30 UTC

bug verified on top based file storage domain with ~7K images.
version 3.6.3

fileSD has 43 calls vs ~8K and CPU time ~49.3 vs 361.
profile output:


Sun Feb 14 22:52:27 2016    /home/emarcian/Desktop/vdsm_3_6_withcpuaffinity_1800sec.prof

         24753697 function calls (26521217 primitive calls) in 159.168 seconds

   Ordered by: cumulative time
   List reduced from 2301 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1284/1337    0.041    0.000  100.695    0.075 threading.py:753(Thread.run)
  120/148    0.002    0.000   91.835    0.621 utils.py:733(wrapper)
     1552    0.037    0.000   58.778    0.038 dispatcher.py:65(wrapper)
     1552    0.015    0.000   58.226    0.038 task.py:100(Task.wrapper)
     1552    0.048    0.000   58.211    0.038 task.py:1155(Task.prepare)
48459/48471    0.727    0.000   57.438    0.001 executor.py:182(_Worker._execute_task)
     1552    0.011    0.000   54.275    0.035 task.py:869(Task._run)
     1552    0.185    0.000   54.265    0.035 logUtils.py:44(wrapper)
        4    0.006    0.001   51.891   12.973 utils.py:930(retry)
        1    0.000    0.000   51.882   51.882 clientIF.py:452(clientIF._recoverThread)
        1    0.004    0.004   51.876   51.876 clientIF.py:459(clientIF._recoverExistingVms)
       45    0.001    0.000   50.515    1.123 vm.py:774(Vm.preparePaths)
       90    0.003    0.000   50.513    0.561 clientIF.py:307(clientIF.prepareVolumePath)
       45    0.290    0.006   50.170    1.115 hsm.py:3195(HSM.prepareImage)
       45    7.606    0.169   49.325    1.096 fileSD.py:432(NfsStorageDomain.getAllVolumes)
      118    0.012    0.000   39.895    0.338 xmlrpc.py:82(SimpleThreadedXMLRPCServer._process_requests)
      118    0.004    0.000   39.684    0.336 SocketServer.py:643(RequestHandler.__init__)
      118    0.002    0.000   39.642    0.336 BaseHTTPServer.py:336(RequestHandler.handle)
      236    0.008    0.000   39.640    0.168 BaseHTTPServer.py:301(RequestHandler.handle_one_request)
      118    0.035    0.000   39.528    0.335 SimpleXMLRPCServer.py:467(RequestHandler.do_POST)

Comment 6 errata-xmlrpc 2016-03-09 19:28:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html