1177634 – [scale][performance] getAllVmStats too slow when running many vms

Bug 1177634 - [scale][performance] getAllVmStats too slow when running many vms

Summary: [scale][performance] getAllVmStats too slow when running many vms

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	ovirt-3.6.0-rc3
Target Release:	4.17.8
Assignee:	Michal Skrivanek
QA Contact:	Eldad Marciano
Docs Contact:
URL:
Whiteboard:
Depends On:	1177801 1179696 1179950 1182094 1182670 1185279 1223053 1265144
Blocks:	1177530
TreeView+	depends on / blocked

Reported:	2014-12-29 15:07 UTC by Eldad Marciano
Modified:	2016-06-28 21:56 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Clone Of:
Clones:	1185279 (view as bug list)
Environment:
Last Closed:	2016-02-10 12:52:15 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	ylavi: ovirt-3.6.0? rule-engine: planning_ack? michal.skrivanek: devel_ack+ rule-engine: testing_ack+

Attachments	(Terms of Use)
python profiler results (58.09 KB, text/plain) 2014-12-29 17:53 UTC, Eldad Marciano	no flags	Details
strace log (135.59 KB, text/plain) 2014-12-29 17:53 UTC, Eldad Marciano	no flags	Details
20 vms CPU TIME (880.01 KB, application/octet-stream) 2015-01-05 14:05 UTC, Eldad Marciano	no flags	Details
64 vms CPU time (906.86 KB, application/octet-stream) 2015-01-05 14:06 UTC, Eldad Marciano	no flags	Details
20 vms WALL TIME (873.65 KB, application/octet-stream) 2015-01-05 14:07 UTC, Eldad Marciano	no flags	Details
64 vms WALL TIME (892.34 KB, application/octet-stream) 2015-01-05 14:07 UTC, Eldad Marciano	no flags	Details
82 vms CPU (932.02 KB, application/octet-stream) 2015-01-06 20:01 UTC, Eldad Marciano	no flags	Details
82 vms WALL (882.78 KB, application/octet-stream) 2015-01-06 20:02 UTC, Eldad Marciano	no flags	Details
82 vms numa invoke commented (496.00 KB, application/octet-stream) 2015-01-07 14:47 UTC, Eldad Marciano	no flags	Details
3.4 av14 82 vms profile (595.06 KB, application/octet-stream) 2015-01-11 12:14 UTC, Eldad Marciano	no flags	Details
25 vms low cpu version 3_4_4 (643.82 KB, application/octet-stream) 2015-01-12 13:36 UTC, Eldad Marciano	no flags	Details
25 vms pure xml rpc version 3.5 (859.53 KB, application/octet-stream) 2015-01-13 13:06 UTC, Eldad Marciano	no flags	Details
25 vms pure json rpc version 3.5 (877.56 KB, application/octet-stream) 2015-01-13 13:07 UTC, Eldad Marciano	no flags	Details
25 vms 1800 sec 3_5 jsonrpc (874.17 KB, application/octet-stream) 2015-01-14 10:18 UTC, Eldad Marciano	no flags	Details
25 vms 1800 sec 3_4 (570.68 KB, application/octet-stream) 2015-01-14 10:19 UTC, Eldad Marciano	no flags	Details
vdsmd_82vms_pure_json_1800sec_cpu (905.60 KB, application/octet-stream) 2015-01-18 15:59 UTC, Eldad Marciano	no flags	Details
vdsmd_25vms_json_withnoMOM_1800sec_cpu.prof (605.09 KB, application/octet-stream) 2015-01-22 16:46 UTC, Eldad Marciano	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1174172	0	high	CLOSED	engine report memory zero for running vms.	2021-02-22 00:41:40 UTC
oVirt gerrit	36626	0	None	None	None	Never

Internal Links: 1174172

Description Eldad Marciano 2014-12-29 15:07:00 UTC

Description of problem:
the following method 'getAllVmStats' running very slow 22 sec.
wide impact.
hit the vdsm performance.
hit the engine performance.

Version-Release number of selected component (if applicable):
3.5 VT 13.5

How reproducible:
100%

Steps to Reproduce:
1. run 66 vms 
2. using vdsClient -s 0 getAllVmStats
3.

Actual results:
low statistics scan.

Expected results:
fast results.

Additional info:

Comment 1 Eldad Marciano 2014-12-29 17:40:17 UTC

there is some findings related to this bug:
the actual API call delayed due to some resources starvation.
by the following strace can we saw some system wait call while using vdsClient

access("/usr/bin/python", R_OK)         = 0
rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fedaff369d0) = 10817
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x43d2f0, [], SA_RESTORER, 0x3335c326a0}, {SIG_DFL, [], SA_RESTORER, 0x3335c326a0}, 8) = 0
wait4(-1, 
# here we wait like up to 4 sec and sometimes even more like 10 sec.


by using cproiler can be seen delays on the following:
412823 function calls (411408 primitive calls) in 20.331 CPU seconds
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
6    0.000    0.000   19.535    3.256 socket.py:409(readline)
1    0.000    0.000   20.166   20.166 vdsClient.py:466(do_getAllVmStats)
1    0.000    0.000   20.157   20.157 xmlrpclib.py:1479(__request)
1    0.036    0.036   20.337   20.337 {execfile}



further investigation required.

Comment 2 Eldad Marciano 2014-12-29 17:53:08 UTC

Created attachment 974121 [details]
python profiler results

Comment 3 Eldad Marciano 2014-12-29 17:53:32 UTC

Created attachment 974122 [details]
strace log

Comment 5 Barak 2014-12-30 09:52:58 UTC

Eldad,

When opening a bug please give all the environmental relevant details:
1 - system you are running it on (1 engine 1 host ? hosted engine ?)
2 - the title implies that hooks delay the response , what hooks are installed ? 
    what is the time when no hooks are used ?
3 - vdsm logs ?
4 - is this a new test (only 3.5 ?) or do you have the numbers also on 3.4 ?

Comment 6 Eldad Marciano 2014-12-30 11:00:13 UTC

1. setup distribution:
-1 engine 
-1 cluster \ sd
-37 hosts 
-3K vms.

hosts distribution (engine + hosts):
24 cores, 64GB RAM, 

2. about the hooks, I meant to hooks.py module it used by the API module in some of basics action like 'getAllVmStats' 

3. there is no errors at the logs are beacuse of that i didn't place them here.

Comment 7 Eldad Marciano 2014-12-30 11:01:28 UTC

4. it's not a new test, we just using real (physical hardware).

Comment 8 Oved Ourfali 2014-12-30 11:34:33 UTC

(In reply to Eldad Marciano from comment #6)
> 1. setup distribution:
> -1 engine 
> -1 cluster \ sd
> -37 hosts 
> -3K vms.
> 

And the issue occurs on all hosts?

> hosts distribution (engine + hosts):
> 24 cores, 64GB RAM, 
> 
> 2. about the hooks, I meant to hooks.py module it used by the API module in
> some of basics action like 'getAllVmStats' 
> 

But how do you know that the reason for the performance hit is this specific module? Are there any hooks applied?

> 3. there is no errors at the logs are beacuse of that i didn't place them
> here.

Logs are important anyhow.

Comment 9 Eldad Marciano 2014-12-30 12:17:03 UTC

i used jprofiler at the engine side to understand why start vms getting slow during our ramp up process.
the vdsbroker.getallvmstat.VDSCommand running ~20 sec.
and also other vds api methods.

at the vdsm side i used the vdsClient to trigger getallvmstats and using python profiler at the same time and clearly can be seen the problem.

bottom line both engine and vdsm profilers point about long time execution at the vdsm side.

i suggest to set up debug session together since the logs are huge.

Comment 10 Barak 2014-12-30 16:40:43 UTC

O.K.,

So this is an issue that was identified by you as a problem (66 vms with the above configuration on a single host) - getAllVmStats takes 22 sec.

IIUC since you have already used the python profiler to identify that hooking.py takes a significant role in the delay (assuming no hooks were actually deployed),
Than there is some other info I would like to see/know:

1 - profile of a single VM (mem, disks, nics, OS, with guest agent ?)
2 - can you share the entire single flow profiler output (assuming the engine did 
    not hammer that vdsm on that time with additional API calls)
3 - what were the results for this test on 3.4 ?
4 - what was the load (lets start with load avarage & mem consumption) on the 
    hypervisor at that ?

Comment 11 Eldad Marciano 2014-12-31 09:54:41 UTC

1. single vm profile:
   1 CPU, 256 MB RAM, 10GB disk (thin provision), OS rhel 6.5 with guest agent).

2. Already done strace of 'vdsClient -s 0 getAllVmStats' and also the pytohn profiler results attached to bug.

as you suggest engine was shout down to avoid other api calls. and i get the same results.

i have added some more logs in the code to understand batter the flow.
results for 'time vdsClient -s 0 getAllVmStats' (with no engine running) :

...
real    0m22.298s #total time 
user    0m0.820s
sys     0m0.157s

by the logs the actual api call time is 6 sec:
Thread-198511::DEBUG::2014-12-31 09:31:36,298::API::1296::vds::(getAllVmStats) STATLOG: total by 6.83962011337 sec
Thread-198511::DEBUG::2014-12-31 09:31:36,307::BindingXMLRPC::487::vds::(getAllVmStats) STATLOGFROMXML getAllVmStats time is 6.852

there is more 16 sec spent on waiting (see the strace output i mention above).

there is another thing that results are inconsistent.
(we always having delays between few seconds up to ~22 sec).

3. we have not 3.4 setup so i can tell what the status there we have also tested the 3.4 via simulator, this bug affected from real.

4. load statistics: load in average ~20%, memfree ~40GB.

Comment 12 Barak 2015-01-01 13:01:22 UTC

Nir can you please take a look and tell us where was the vdsm time spent ?

Keep in mind that when using vdsClient than the protocol used is XML-RPC

Comment 13 Yaniv Bronhaim 2015-01-01 16:19:26 UTC

getAllVmsState always took some time to return when running 20+ vms. even less.. it gathers many deatils about the vms that retreive partially from libvirt. 
running 50+ vms will effect the time obviously. we tried already to improve the timing for that verb, but this is not something new

Comment 14 Eldad Marciano 2015-01-01 18:17:57 UTC

this is very bad.
as part of the scale effort we should handle more than 100 vms each host.
in reasonable time.

see the following timing results for 66 vms:

]# time vdsClient -s 0 getAllVmStats &> /dev/null 

real    0m43.798s
user    0m1.044s
sys     0m0.139s

by the logs it looks like that:
tail -f /var/log/vdsm/vdsm.log |grep 'TRACEHOOKS\|STATLOG' |grep -v 'GuestMonitor'
Thread-233::DEBUG::2015-01-01 18:10:29,461::API::1281::vds::(getAllVmStats) STATLOG:  before_get_all_vm_stats takes 0.583 sec
Thread-233::DEBUG::2015-01-01 18:10:39,266::API::1290::vds::(getAllVmStats) STATLOG: collect vms statistics 1 by 1 takes 7.496 sec
Thread-233::DEBUG::2015-01-01 18:10:40,331::API::1294::vds::(getAllVmStats) STATLOG: getAllVmStats pars statistics runs by 0.802846908569 sec
Thread-233::DEBUG::2015-01-01 18:10:40,614::API::1296::vds::(getAllVmStats) STATLOG: total by 11.7333049774 sec
Thread-233::DEBUG::2015-01-01 18:10:40,896::BindingXMLRPC::487::vds::(getAllVmStats) STATLOGFROMXML getAllVmStats time is 12.023

as you can see the actual vms data collecting + parsing takes only 12 sec(which is not so fast).
and the overall vdsClient time takes 43 sec.
we have a huge gap.

Comment 15 Nir Soffer 2015-01-04 09:49:30 UTC

Eldad, what we need to do is profile vdsm - not vdsclient, because we want to understand vdsm performance and not vdsclient.

Profiling using cprofile is useless because it profiles only all the threads in vdsm, and cprofile is profiling only the thread where the profiler was started.

You need install yappi on the host where you want to profile. The best way is using pip (you may need to install pip first):

    pip install yappi

Then you should enable profiling in vdsm configuration:

# /etc/vdsm/vdsm.conf
[vars]
profile_enable = true
profile_builtins = true
profile_clock = cpu

And stop vdsm.

To create a profile:

1. start vdsm
2. start the vms you need for the tests
3. run vdsm with normal conditions from engine for 5-10 minutes
4. stop vdsm

The profile is saved in /var/run/vdsm/vdsmd.prof - copy this file elsewhere because it will be overwritten in the next time vdsm stops.

Disable profiling before starting vdsm again.

[vars]
profile_enable = false

Please try to get this profile and upload the file here. From the profile we can useful statistics. The easiest way is using contrib/profile-stats from vdsm source.

Comment 16 Nir Soffer 2015-01-05 08:48:29 UTC

From initial profile, we have one issue with O(N^2) search in NfsStorageDomain.getAllVolumes.

[root@host06-rack04 ~]# ./profile-stats -c -scumulative vdsmd.prof 
Sun Jan  4 17:39:11 2015    vdsmd.prof

         24867930 function calls (25536205 primitive calls) in 1450.530 CPU seconds

   Ordered by: cumulative time
   List reduced from 2723 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
89868/89869  361.385    0.004  361.385    0.004 fileSD.py:444(<genexpr>)
  117/118    0.026    0.000  359.611    3.048 dispatcher.py:65(wrapper)
  117/118    0.010    0.000  359.325    3.045 task.py:99(Task.wrapper)
  117/118    0.038    0.000  359.315    3.045 task.py:1155(Task.prepare)
  117/118    0.005    0.000  355.929    3.016 task.py:869(Task._run)
  117/118    0.061    0.001  355.902    3.016 logUtils.py:40(wrapper)
      8/9    0.000    0.000  353.487   39.276 vm.py:2339(Vm.preparePaths)
    24/25    0.001    0.000  353.485   14.139 clientIF.py:267(clientIF.prepareVolumePath)
      8/9    0.057    0.007  353.311   39.257 hsm.py:3188(HSM.prepareImage)
      8/9    0.000    0.000  352.837   39.204 fileSD.py:414(NfsStorageDomain.getAllVolumes)


   Ordered by: cumulative time
   List reduced from 2723 to 20 due to restriction <20>

Function                                       was called by...
                                                   ncalls  tottime  cumtime
fileSD.py:444(<genexpr>)                       <- 89869/89868  361.385  361.385  fileSD.py:414(NfsStorageDomain.getAllVolumes)


In getAllVolumes, if all images are created from same template, is equal to this:

    for metaPath in volMetaPaths:
        if len(tuple(vPath for vPath in volMetaPaths                                                                                                                
                     if imgUUID in vPath)) > 1:
            # Add template additonal image
            volumes[volUUID]['imgs'].append(imgUUID)
        else:
            # Insert at head the template self image
            volumes[volUUID]['imgs'].insert(0, imgUUID)

Eldad:
- How many disks do we have on the same storage domain?
- Are all disks created from same template?

Comment 17 Eldad Marciano 2015-01-05 13:48:17 UTC

Yep

Comment 18 Eldad Marciano 2015-01-05 14:05:23 UTC

Created attachment 976456 [details]
20 vms CPU TIME

Comment 19 Eldad Marciano 2015-01-05 14:06:23 UTC

Created attachment 976457 [details]
64 vms CPU time

Comment 20 Eldad Marciano 2015-01-05 14:07:04 UTC

Created attachment 976458 [details]
20 vms WALL TIME

Comment 21 Eldad Marciano 2015-01-05 14:07:35 UTC

Created attachment 976459 [details]
64 vms WALL TIME

Comment 22 Eldad Marciano 2015-01-05 14:09:44 UTC

adding some profile's results.
there is two type of profiling CPU and WALL time.
by the WALL time can be seen some wait \ and sleep method which probably make troubles.

Comment 23 Allon Mureinik 2015-01-05 16:11:28 UTC

[Tentatively targeting for 3.5.1 - once the patch is up there for master, let's re-discuss if it's worth the risk/effort for 3.5.1/2/3].

Comment 24 Nir Soffer 2015-01-05 19:32:55 UTC

Eldad, can you test this patch on one of your hosts?

I would like to see profiling results with this patch - I expect that getAllVolumes will disappear from the profile, and we will have less noise checking why getAllVMStats is too slow.

Comment 25 Michal Skrivanek 2015-01-06 09:00:38 UTC

looking at 20 VM profile I can say
- Nir's getAllVolumes
- XML filtering in guest agent is very ineffective (per character)
- MOM XML parsing eats a lot
- no particular problem with getAllVMStats, threading synchronization seem to take its toll even at 20 VMs
- JSON RPC parsing takes some time, but that's understandable

Nir, can you pleae help me understand the numaUtils.py:22(<module>)?

Comment 26 Michal Skrivanek 2015-01-06 10:40:34 UTC

identified an issue with _getNumaStats which does xml parsing via superVdsm every single time we do stats get() (similar to a balloonInfo problem not that long time ago). Still I'm curious why is it showing like that in the profiling data:)

Comment 27 Nir Soffer 2015-01-06 10:49:17 UTC

(In reply to Michal Skrivanek from comment #26)
> identified an issue with _getNumaStats which does xml parsing via superVdsm
> every single time we do stats get() (similar to a balloonInfo problem not
> that long time ago). Still I'm curious why is it showing like that in the
> profiling data:)

The "numaUtils.py:22(<module>)" may be a bug in yappi or pstat module. We had similar issues on Python 2.6, see https://code.google.com/p/yappi/issues/detail?id=45

The best way to handle this is to profile also on Python 2.7, where we may see correct output.

Comment 28 Nir Soffer 2015-01-06 14:30:44 UTC

Checking scale setup with 64 vms with patch http://gerrit.ovirt.org/36593,
we can see that vdsm is completely overloaded. In this state, running 150-175
threads, anything will be barely usable.

We see vdsm cpu usage typically at 1000-1500% on 24 cores machines. This 
consume about 50% of system cpu time. With non-idle vms, vdsm response 
time will be much worse. getAllVmStats is becoming slow is just a side 
effect of this overloading.

To continue with this, please do:

1. Find the amount of vms that keep vdsm using about 75% cpu time
   looking at a machine running 20 vms, this should be around 40 vms.

2. Do cpu and wall time profile in this load for the same time (e.g. 30 minutes)
   You can script this:
   service vdsmd start
   sleep 1800
   service vdsmd stop
   mv /var/run/vdsm/vdsmd.prof vdsm-test-name.prof

3. Upload both profiles here, or maybe open a new bug for each issue found
   as the getAllVolumes issue is resolved.

Repeat this test with:
- rhev 3.4.4 - to detect regressions since last release
- rehv 3.5 using xmlrpc - to detect regression related to jsonrpc

Note that all profiles must use same time and workload, so we can compare them
for example: 40 vms, 30 minutes, rhev-3.4, rhev-3.5-jsonrpc, rhev-3.5-xmlrpc.

Please keep the need info flag until we have these profiles.

Comment 29 Eldad Marciano 2015-01-06 20:01:51 UTC

Created attachment 976919 [details]
82 vms CPU

Comment 30 Eldad Marciano 2015-01-06 20:02:35 UTC

Created attachment 976920 [details]
82 vms WALL

Comment 31 Michal Skrivanek 2015-01-07 08:10:47 UTC

(In reply to Eldad Marciano from comment #29)
> Created attachment 976919 [details]
> 82 vms CPU

is this with the NUMA stats commented out and cleaned MOM policies?
What was the overall system load, vdsm CPU usage?

Comment 32 Nir Soffer 2015-01-07 08:40:01 UTC

(In reply to Eldad Marciano from comment #29)
> Created attachment 976919 [details]
> 82 vms CPU

Eldad, did you read comment 28?

Please keep the need info flag until we have *all* the information requested.

Comment 33 Francesco Romani 2015-01-07 09:40:33 UTC

(In reply to Michal Skrivanek from comment #26)
> identified an issue with _getNumaStats which does xml parsing via superVdsm
> every single time we do stats get() (similar to a balloonInfo problem not
> that long time ago). Still I'm curious why is it showing like that in the
> profiling data:)

Indeed the NUMA code is, let's say, very near to the top in the chart of code in need of cleanup and optimization (And yes, it is mostly my bad to have it accepted on it's current form).

I also wonder why it started so lately to show up in the profiles.

As short term fix, which I believe it is also feasible for 3.5.x, I crafted

http://gerrit.ovirt.org/#/c/36626/

Comment 34 Barak 2015-01-07 10:02:52 UTC

Just to calrify the noise in here.
this bug is about getAllVMStats critical path, which means now the only relevant issue for this flow looks like the inefficient mom (_getNumaStats) part.

Eldad we need to see whether the getAllVMStats improved when the getNumaStats is commented.

Ignore the other issues for now they will require different bugs.

Comment 35 Michal Skrivanek 2015-01-07 11:43:35 UTC

the guest agent filtering is now being tracked as bug 1179696.
it is not be relevant to the getAllVMStats responsiveness, also there are no known reports from the field and the code is unchanged since about 3.3.z

Comment 36 Eldad Marciano 2015-01-07 14:36:02 UTC

by comment out the the numa stat invoke the results are same for 82 vms running:
the following patch http://gerrit.ovirt.org/#/c/36626/ applied too

Wed Jan  7 14:27:52 2015    vdsmd_82vms_numacommented_fileSD_numUtils_patched_cpu.prof

         49280718 function calls (51228879 primitive calls) in 17517.372 CPU seconds

   Ordered by: internal time
   List reduced from 2772 to 1000 due to restriction <1000>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
19224/19358 1789.695    0.093 6271.835    0.324 threading.py:481(Thread.run)
  1890704 1734.238    0.001 1763.005    0.001 numaUtils.py:22(<module>)
    41892 1341.277    0.032 3766.584    0.090 vm.py:2849(Vm._getRunningVmStats)
2116685/2116903  590.606    0.000 2356.711    0.001 guestagent.py:80(<genexpr>)
17556/581334  534.025    0.030 3128.514    0.005 guestagent.py:87(filt)
   908550  451.645    0.000  451.645    0.000 utils.py:425(convertToStr)
17879/34159  436.846    0.024  982.969    0.029 decoder.py:162(JSONObject)
   149859  422.248    0.003  847.468    0.006 expatbuilder.py:743(ExpatBuilderNS.start_element_handler)
9570/612060  268.384    0.028  358.737    0.001 minidom.py:305(_get_elements_by_tagName_helper)
    41892  267.465    0.006  347.702    0.008 vm.py:2935(Vm._getGraphicsStats)

Comment 37 Eldad Marciano 2015-01-07 14:47:24 UTC

Created attachment 977345 [details]
82 vms numa invoke commented

Comment 38 Francesco Romani 2015-01-07 14:50:14 UTC

(In reply to Eldad Marciano from comment #36)
> by comment out the the numa stat invoke the results are same for 82 vms
> running:
> the following patch http://gerrit.ovirt.org/#/c/36626/ applied too
> 
> Wed Jan  7 14:27:52 2015   
> vdsmd_82vms_numacommented_fileSD_numUtils_patched_cpu.prof

This means that, as Nir pointed out in https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c27 a probable bug in the profiler.

Further evidence supporting this theory is that all code in vdsm/numaUtils.py
is run when getVmNumaNodeRuntimeInfo is invoked - this is what getAllVmStats eventually does.

However, the second line of code in that function is

 if 'guestNumaNodes' in vm.conf:

And this cover the whole function. So, unless Engine sends explicit NUMA configuration, the cost of that function is negligible - and it returns just an empty dict(). That configuration could be sent only when VM is created.

I run a quick grep on the affected host, and that configuration tunable is NOT present. I grep'd all the available logs and not only the one, but
Eldad, please confirm if I'm wrong or not.

If the above is correct, the bottleneck is anywhere else. However, my gut feeling is just VDSM is overloaded, and to properly solve that we need deep changes (e.g. thread reduction, more C speedups).

Comment 39 Michal Skrivanek 2015-01-07 15:00:06 UTC

hm, then the question is if we can trust the debugging profile at all...
Nir suggested Python 2.7, maybe we can check if we get different results?

Comment 40 Nir Soffer 2015-01-07 23:54:48 UTC

The getAllVolumes issue is track now in bug 1179950.

Comment 41 Allon Mureinik 2015-01-08 10:01:50 UTC

(In reply to Nir Soffer from comment #40)
> The getAllVolumes issue is track now in bug 1179950.
Can you also please change the bug's title to reflect this?

Comment 42 Michal Skrivanek 2015-01-08 10:20:28 UTC

if we assume problem with the profiler in NUMA related code then the next CPU bottleneck seems to be the guest agent filtering, so it make sense to give it a try with disabled guest agent inside the VMs
it's not relevant ini the getAllVmStats call but it can/should bring down the CPU usage significantly...so would be worth reviewing the overall host load and vdsm total CPU usage then, and analyzing the profile further if the responsivness is still bad

Comment 43 Michal Skrivanek 2015-01-08 10:22:12 UTC

Gil, I don't believe this is a blocker. It is not a recent regression, rather a scaling issue we've developed over the time

Comment 44 Gil Klein 2015-01-08 11:18:15 UTC

Michael, I've asked Eldad to try the same flow with 3.4, to make sure it is not a performance degradation.

If we won't see a degradation when comparing to 3.5 to 3.4, I'm fine with removing the blocker flag.

Comment 46 Michal Skrivanek 2015-01-08 14:13:23 UTC

Doron, we need someone from SLA to start/keep looking at the NUMA issue. Even though it turns out it may not be the actual problem on Eldad's setup as NUMA is not in use (hinting at a problem with the debugger). 
But this will most certainly be a big issue once NUMA is enabled. So that relevant code needs to change ASAP

Comment 47 Eldad Marciano 2015-01-11 12:13:41 UTC

by investigating 3.4.4 (av14) regression were found.

both 3.4 and 3.5 tested by the following conditions:
-82 vms
-1800 sec CPU profiling.
-same storage.
-same hardware.

3.4 vdsmd CPU% is ~500%. (peaks 850%).
3.5 vdsmd CPU% is ~1600%  (peaks 2100%).

adding the profiler results.

Comment 48 Eldad Marciano 2015-01-11 12:14:39 UTC

Created attachment 978742 [details]
3.4 av14 82 vms profile

Comment 49 Gil Klein 2015-01-11 12:21:22 UTC

Based on comment #47, I suggest we move this BZ back to 3.5GA, and provide a fix to the performance degradation.

Comment 50 Nir Soffer 2015-01-11 15:46:40 UTC

(In reply to Eldad Marciano from comment #47)
> by investigating 3.4.4 (av14) regression were found.
> 
> both 3.4 and 3.5 tested by the following conditions:
> -82 vms
> -1800 sec CPU profiling.
> -same storage.
> -same hardware.
> 
> 3.4 vdsmd CPU% is ~500%. (peaks 850%).
> 3.5 vdsmd CPU% is ~1600%  (peaks 2100%).
> 
> adding the profiler results.

Eldad, interesting results, but please read again comment 28 and
comment 45.

What you tested is not supported usage of vdsm. It is overloaded in
both 3.4.4 and 3.5.

For useful profiles, we need a profile when vdsm average cpu usage
is less then 100%.

For 3.5 profiles, we need both xmlrpc and jsonrpc, to detect if the 
this is a jsonrpc related issue.

Comment 51 Francesco Romani 2015-01-12 09:15:04 UTC

(In reply to Eldad Marciano from comment #47)
> by investigating 3.4.4 (av14) regression were found.
> 
> both 3.4 and 3.5 tested by the following conditions:
> -82 vms
> -1800 sec CPU profiling.
> -same storage.
> -same hardware.
> 
> 3.4 vdsmd CPU% is ~500%. (peaks 850%).
> 3.5 vdsmd CPU% is ~1600%  (peaks 2100%).
> 
> adding the profiler results.

In addition to what Nir pointed out in https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c50 :

3.5 vdsmd is running with NUMA disabled (code commented out), right?

Comment 52 Michal Skrivanek 2015-01-12 11:45:14 UTC

adding back needinfo from comment #46

Comment 53 Nir Soffer 2015-01-12 12:36:47 UTC

(In reply to Michal Skrivanek from comment #52)
> adding back needinfo from comment #46

Following one of the profiles:

$ profile-stats -c vdsmd_20vms_CPUTIME.prof

Mon Jan  5 21:44:50 2015    vdsmd_20vms_CPUTIME.prof

         186697006 function calls (190034885 primitive calls) in 5019.501 seconds

   Ordered by: internal time
   List reduced from 2629 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   214840  866.080    0.004  866.080    0.004 fileSD.py:444(<genexpr>)
309450/309987  233.689    0.001  283.696    0.001 spark.py:211(Parser.buildState)
  2298459  190.133    0.000  203.422    0.000 numaUtils.py:22(<module>)

We can use the callers sections to tell who is calling this function:

   Ordered by: internal time
   List reduced from 2629 to 20 due to restriction <20>

Function                                                       was called by...
                                                                   ncalls  tottime  cumtime
fileSD.py:444(<genexpr>)                                       <-  214840  866.080  866.080  fileSD.py:414(NfsStorageDomain.getAllVolumes)
spark.py:211(Parser.buildState)                                <-  309450  232.895  283.696  spark.py:187(Parser.parse)
                                                                    537/0    0.793    4.649  spark.py:211(Parser.buildState)
numaUtils.py:22(<module>)                                      <- 2298458  190.133  203.422  guestagent.py:80(<genexpr>)
                                                                        1    0.000    0.000  vm.py:23(<module>)

This is guestagent.py:80<genexpr> - looking at line 80:

 80     return ''.join(filterXmlChar(c) for c in u)

This is part of guestagent._filterXMLChars()

Sorting by cumulative time, we can see that is guestagent
channel listening thing: 

$ profile-stats -c -r50 -scumulative vdsmd_20vms_CPUTIME.prof

Mon Jan  5 21:44:50 2015    vdsmd_20vms_CPUTIME.prof

         186697006 function calls (190034885 primitive calls) in 5019.501 seconds

   Ordered by: cumulative time
   List reduced from 2629 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
<snip>
    35457    6.519    0.000  566.485    0.016 guestagent.py:442(GuestAgent._onChannelRead)
19288/19290    5.414    0.000  557.347    0.029 guestagent.py:419(GuestAgent._handleData)
21218/21224    1.984    0.000  550.738    0.026 guestagent.py:411(GuestAgent._processMessage)
    21224    2.141    0.000  531.987    0.025 guestagent.py:461(GuestAgent._parseLine)
    75550    8.230    0.000  493.572    0.007 vm.py:2804(Vm.getStats)
14160/34416    0.268    0.000  488.231    0.014 Parser.py:423(Evaluator.c_if)
    80648   11.704    0.000  435.757    0.005 spark.py:187(Parser.parse)
    19356    6.237    0.000  434.263    0.022 __init__.py:461(JsonRpcServer._serveRequest)
    75550  132.304    0.002  420.172    0.006 vm.py:2849(Vm._getRunningVmStats)
20236/26978    1.953    0.000  415.520    0.015 Parser.py:389(Evaluator.c_let)
    19356    4.012    0.000  363.549    0.019 Bridge.py:249(DynamicBridge._dynamicMethod)
    21224    1.101    0.000  362.481    0.017 guestagent.py:83(_filterObject)
21224/709226   54.907    0.003  361.380    0.001 guestagent.py:87(filt)
   275340   21.559    0.000  298.404    0.001 guestagent.py:47(_filterXmlChars)
130234/130241    4.935    0.000  287.784    0.002 __init__.py:1034(Logger.debug)
645047/645119    9.435    0.000  286.679    0.000 __builtin__:0(<method 'join' of 'str' objects>)
309450/309987  233.689    0.001  283.696    0.001 spark.py:211(Parser.buildState)
   103302    6.133    0.000  283.099    0.003 __init__.py:1155(Logger._log)
2573777/2573798   65.788    0.000  269.210    0.000 guestagent.py:80(<genexpr>)

Looks like the numaUtils line is a bug in the profiler, probbably
generator expressions are not recorded correctly.

Comment 54 Francesco Romani 2015-01-12 12:56:21 UTC

(In reply to Nir Soffer from comment #53)

> Looks like the numaUtils line is a bug in the profiler, probbably
> generator expressions are not recorded correctly.

Makes sense. In a former test numaUtils code was disabled by commenting it out enterily, and VDSM stats (cpu usage) hasn't improved significantly.

Comment 55 Eldad Marciano 2015-01-12 13:36:49 UTC

Created attachment 979147 [details]
25 vms low cpu version 3_4_4

Comment 56 Michal Skrivanek 2015-01-12 15:40:07 UTC

(In reply to Eldad Marciano from comment #55)
> Created attachment 979147 [details]
> 25 vms low cpu version 3_4_4

would you please always add rouch CPU/load usage during the test?

also, I don't see the guestagent filtering in this test, was it there and working?

Comment 57 Michal Skrivanek 2015-01-13 10:47:21 UTC

82 vms is irrelevant due to excessive overcommitment of vcpus/cores. The load on the system is ~60, but it is still working, so no issue really

I want to see at most 20-25 vms and compare the total cpu usage of vdsm, in the same conditions, to be able to tell anything

Comment 58 Eldad Marciano 2015-01-13 13:06:33 UTC

Created attachment 979602 [details]
25 vms pure xml rpc version 3.5

Comment 59 Eldad Marciano 2015-01-13 13:07:32 UTC

Created attachment 979614 [details]
25 vms pure json rpc version 3.5

Comment 60 Eldad Marciano 2015-01-13 13:19:46 UTC

Nir, regarding the jsom vs xml rpc comparison, i have added the profiler results.
by first and short analysis, the following modules are the most heavy (xml and json both):

dispatcher.py, 
HSM.prepareImage, 
API.py:355(VM.getStats) 
clientIF.py:267(clientIF.prepareVolumePath)
logUtils.py:40(wrapper)

I think our next step is tackle each one of those and easily raise the number of running vms.

Comment 61 Francesco Romani 2015-01-13 13:36:26 UTC

On the affected host something's not looking right. I mean besides VDSM slowness :)
Here's a sample from a box of mine running RHEL 6.6 and VDSM 4.17 from master:

100 VMs without GA (nor OS installed) - but only for memory constaints: if I create VMs with enough RAM to be able to run an OS with GA, I can't run enough of them to stress the box, so it is a tradeoff.

If it is needed, I can run less VMs, with GA and with 3.5.0 VDSM.

top - 14:33:04 up  1:03,  1 user,  load average: 0.30, 1.41, 2.27
Tasks: 593 total,   2 running, 590 sleeping,   0 stopped,   1 zombie
Cpu(s): 10.0%us, 16.3%sy,  0.0%ni, 73.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16215968k total,  3376340k used, 12839628k free,   101440k buffers
Swap:  8175612k total,        0k used,  8175612k free,   364252k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                              
 3679 vdsm       0 -20 6732m 214m 9892 S 94.9  1.4   4:23.01 vdsm                                                                                                                                                  
 2445 root      20   0 1145m  20m 5532 R 44.8  0.1   6:32.71 libvirtd  

We know that VDSM eats CPU and burns libvirt: this very data is further confirmation. But the numbers on the affected box seem to bad to be true.

Comment 62 Francesco Romani 2015-01-13 13:37:41 UTC

(In reply to Francesco Romani from comment #61)
> On the affected host something's not looking right. I mean besides VDSM
> slowness :)
> Here's a sample from a box of mine running RHEL 6.6 and VDSM 4.17 from
> master:

TOji> 14:33:04 root [~]$ rpm -qa | grep libvirt
libvirt-0.10.2-46.el6_6.2.x86_64

TOji> 14:37:24 root [~]$ rpm -qa | grep vdsm
vdsm-jsonrpc-4.17.0-166.git773c9db.el6.noarch
vdsm-python-zombiereaper-4.16.8-8.gitf16901f.el6.noarch
vdsm-debuginfo-4.17.0-166.git773c9db.el6.x86_64
vdsm-cli-4.17.0-166.git773c9db.el6.noarch
vdsm-python-4.17.0-166.git773c9db.el6.noarch
vdsm-infra-4.17.0-166.git773c9db.el6.noarch
vdsm-yajsonrpc-4.17.0-166.git773c9db.el6.noarch
vdsm-xmlrpc-4.17.0-166.git773c9db.el6.noarch
vdsm-4.17.0-166.git773c9db.el6.x86_64

TOji> 14:37:29 root [~]$ rpm -qa | grep kernel
kernel-2.6.32-431.23.3.el6.x86_64

TOji> 14:37:34 root [~]$ rpm -qa | grep qemu
qemu-img-rhev-0.12.1.2-2.415.el6_5.14.x86_64
gpxe-roms-qemu-0.9.7-6.12.el6.noarch
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.14.x86_64
qemu-kvm-rhev-tools-0.12.1.2-2.415.el6_5.14.x86_64

Comment 63 Nir Soffer 2015-01-13 14:09:09 UTC

(In reply to Francesco Romani from comment #61)
> On the affected host something's not looking right. I mean besides VDSM
> slowness :)
> Here's a sample from a box of mine running RHEL 6.6 and VDSM 4.17 from
...
> We know that VDSM eats CPU and burns libvirt: this very data is further
> confirmation. But the numbers on the affected box seem to bad to be true.

You are testing with block storage with 100 disks (not based on templates?), but Eladad is using NFS with 3000 disks based on template (6000 volumes).

Comment 64 Nir Soffer 2015-01-13 14:10:18 UTC

(In reply to Eldad Marciano from comment #59)
> Created attachment 979614 [details]
> 25 vms pure json rpc version 3.5

Thanks Eldad!

Comment 65 Nir Soffer 2015-01-13 14:14:12 UTC

(In reply to Eldad Marciano from comment #58)
> Created attachment 979602 [details]
> 25 vms pure xml rpc version 3.5

Eldad, can you tell us how much time each profile was running, and what was the average cpu usage for each version?

We want to see not only which function is on the top of the profile, but how many call were done for each variant.

Comment 66 Eldad Marciano 2015-01-13 14:56:44 UTC

both tests runs for ~900 sec.
the cpu average was ~5-10%.

Comment 67 Michal Skrivanek 2015-01-13 16:03:29 UTC

(In reply to Eldad Marciano from comment #66)
> both tests runs for ~900 sec.
> the cpu average was ~5-10%.

and it was all working +- ok, right? So then this is not really a blocker, nor regression between xmlrpc vs jsonrpc

you didn't add the average cpu usage on 3.4, but I would guess based on the profiles it was in the same ballpark as 3.5...

(retargeting to 3.5.1 as internal testing and examination of running system doesn't show anything horrendous)

Comment 69 Eldad Marciano 2015-01-13 16:15:25 UTC

another interesting element
by looking the profile analysis logUtils very hit the performance.

task.py:1155(Task.prepare)                                               <- 159/158    0.043  387.273  task.py:99(Task.wrapper)
task.py:869(Task._run)                                                   <- 159/158    0.006  383.882  task.py:1155(Task.prepare)
logUtils.py:40(wrapper)                                                  <- 159/158    0.089  383.862  task.py:869(Task._run)
vm.py:2339(Vm.preparePaths)                                              <-   16/15    0.000  379.952  clientIF.py:491(clientIF._recoverVm)
clientIF.py:267(clientIF.prepareVolumePath)                              <-   48/47    0.003  379.950  vm.py:2339(Vm.preparePaths)
hsm.py:3188(HSM.prepareImage)                                            <-   16/15    0.000  379.709  logUtils.py:40(wrapper)

Comment 70 Francesco Romani 2015-01-13 16:16:54 UTC

(In reply to Eldad Marciano from comment #60)
> Nir, regarding the jsom vs xml rpc comparison, i have added the profiler
> results.
> by first and short analysis, the following modules are the most heavy (xml
> and json both):
> 
> dispatcher.py, 
> HSM.prepareImage, 
> API.py:355(VM.getStats) 

Here some patches for review to improve getStats/getAllVmStats performances.
These patches are intentionally kept small in size to make them backportable to stable branch. We all know that we'll most likely need much deeper changes
to improve significantly.

This improves a little Vm._getGraphicsStats: http://gerrit.ovirt.org/#/c/36791/
It is a straightforward and quite safe fix, albeit of little help - this method is very low on profile

getAllVmStats seems suspicious, here's an (admittedly debatable) attempt to make it faster: http://gerrit.ovirt.org/#/c/35140/1

Speaking of Vm._getRunningVmStats(), the problem here is that the code _has_ to do some expensive translation/sanitizations. This is also the source of most (if not all) the calls to convertToStr.
To fix this is feasible, but we'll probably need a microbenchmark/localized test suite, and I'm not sure it is OK for stable branch -not anytime soon, anyway.

All that said, I'm still not convinced about the existence of the _major_ performance regression that this BZ claims.

Comment 71 Nir Soffer 2015-01-13 20:46:50 UTC

(In reply to Michal Skrivanek from comment #67)
> (In reply to Eldad Marciano from comment #66)
> > both tests runs for ~900 sec.
> > the cpu average was ~5-10%.
> 
> and it was all working +- ok, right? So then this is not really a blocker,
> nor regression between xmlrpc vs jsonrpc
> 
> you didn't add the average cpu usage on 3.4, but I would guess based on the
> profiles it was in the same ballpark as 3.5...
> 
> (retargeting to 3.5.1 as internal testing and examination of running system
> doesn't show anything horrendous)

I asked Eldad to do new 3.5 profiles for 1800 seconds, so we can compare with the 3.4 profile.

Lets wait until we have enough data.

Comment 72 Eldad Marciano 2015-01-14 10:18:13 UTC

Created attachment 979928 [details]
25 vms 1800 sec 3_5 jsonrpc

Comment 73 Eldad Marciano 2015-01-14 10:19:14 UTC

Created attachment 979929 [details]
25 vms 1800 sec 3_4

Comment 74 Michal Skrivanek 2015-01-14 11:43:14 UTC

finally apples to apples comparison. Please add high level observation of system load and vdsm CPU usage

from the profiles it looks like there's no real difference between 3.4 and 3.5

Comment 75 Michal Skrivanek 2015-01-14 12:50:14 UTC

opened follow up bugs on SLA issues around MOM(bug 1182097) and NUMA (bug 1182094)

Comment 76 Eldad Marciano 2015-01-14 13:01:45 UTC

the cpu average was ~5-10%  both.
but can be seen difference and regression by running large scale vms.

https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c47

Comment 77 Eldad Marciano 2015-01-14 14:25:55 UTC

(In reply to Nir Soffer from comment #71)
> (In reply to Michal Skrivanek from comment #67)
> > (In reply to Eldad Marciano from comment #66)
> > > both tests runs for ~900 sec.
> > > the cpu average was ~5-10%.
> > 
> > and it was all working +- ok, right? So then this is not really a blocker,
> > nor regression between xmlrpc vs jsonrpc
> > 
> > you didn't add the average cpu usage on 3.4, but I would guess based on the
> > profiles it was in the same ballpark as 3.5...
> > 
> > (retargeting to 3.5.1 as internal testing and examination of running system
> > doesn't show anything horrendous)
> 
> I asked Eldad to do new 3.5 profiles for 1800 seconds, so we can compare
> with the 3.4 profile.
> 
> Lets wait until we have enough data.

In trying to ignore this this calls which comes from 'task.py task._run' having much better results.

Comment 78 Michal Skrivanek 2015-01-14 15:00:03 UTC

(In reply to Eldad Marciano from comment #76)
> the cpu average was ~5-10%  both.

then we're good;-) decreasing severity and removing zstream request

> but can be seen difference and regression by running large scale vms.

as I said before, that's irrelevant on such small host. Please get a bigger one, maybe Yuri's setup or something...but you should not have more vcpus than the physical ones

Comment 81 Nir Soffer 2015-01-17 00:18:04 UTC

(In reply to Michal Skrivanek from comment #74)
> finally apples to apples comparison. Please add high level observation of
> system load and vdsm CPU usage
> 
> from the profiles it looks like there's no real difference between 3.4 and
> 3.5

There is a difference - assuming that both did run for the same time (1800 seconds),
and processed the same amount of requests from engine, the 3.5 version is using 25% 
more cpu time.

Fri Jan 16 23:55:00 2015    vdsmd_25vms_3_4_1800sec_cpu.prof

         16047048 function calls (17034423 primitive calls) in 1530.915 seconds

   Ordered by: internal time
   List reduced from 1945 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   275100  512.984    0.002  512.984    0.002 fileSD.py:448(<genexpr>)
       25  121.375    8.670  121.499    4.860 sampling.py:358(VmStatsThread.run)
       44   67.283    4.205  145.745    3.312 threading.py:481(Thread.run)
   190183   40.683    0.014   40.689    0.000 minidom.py:305(_get_elements_by_tagName_helper)
    48659   40.489    0.001  108.048    0.002 expatbuilder.py:743(ExpatBuilderNS.start_element_handler)
    83826   32.884    0.000   35.018    0.000 StringIO.py:208(StringIO.write)
    83943   23.057    0.000   23.207    0.000 minidom.py:349(Attr.__init__)
   117222   22.567    0.000   25.333    0.000 minidom.py:281(Document._append_child)
    62705   22.006    0.000   24.930    0.000 spark.py:103(Parser.addRule)
    68482   21.836    0.000   37.521    0.001 expatbuilder.py:274(ExpatBuilderNS.character_data_handler_cdata)

Fri Jan 16 23:54:48 2015    vdsmd_25vms_3_5_jsonrpc_1800sec_cpu.prof

         80952438 function calls (82245529 primitive calls) in 1921.656 seconds

   Ordered by: internal time
   List reduced from 2637 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   275075  611.045    0.002  611.045    0.002 fileSD.py:444(<genexpr>)
    36131   56.048    0.002  180.522    0.005 vm.py:2849(Vm._getRunningVmStats)
  1673681   41.673    0.000   65.290    0.000 protocoldetector.py:94(MultiProtocolAcceptor._process_events)
   157109   40.947    0.000   91.926    0.001 spark.py:211(Parser.buildState)
   201940   32.269    0.000   38.662    0.000 spark.py:103(Parser.addRule)
    40388   24.992    0.001   29.322    0.001 spark.py:146(Parser.makeFIRST)
    16890   23.803    0.001   52.007    0.003 inspect.py:247(getmembers)
   926097   21.156    0.000   21.156    0.000 utils.py:425(convertToStr)
    80776   19.588    0.000   25.824    0.000 spark.py:28(_namelist)
  1712725   19.368    0.000   19.368    0.000 __builtin__:0(poll)


1. FileStorageDomain.getAllVolumes
----------------------------------

fileSD.py:444(<genexpr>) is called by FileStorageDomain.getAllVolumes, which takes
about 32% of cpu time. This time is spent only when starting and recovering vms.
This issue is tracked in bug 1179950 for 3.5.1.


2. VM.getStats
--------------

vm.py:2849(Vm._getRunningVmStats) takes 9.3% of cpu time, and we don't see
related code in the top of the 3.4 profile. If we follow the function calling
this in 3.5, we see that the it is called by API.py:(VM.getStats).

Comparing 3.4 and 3.5 results for this function:

3.4:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  19042    1.360    0.000  210.361    0.011 API.py:349(VM.getStats)

3.5:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  36131    4.779    0.000  271.401    0.008 API.py:355(VM.getStats)

3.5 is spending more time in this function, and calling it much more.

Looking in the callers:

3.4:

ncalls  tottime  cumtime
  2400    0.108   15.787  BindingXMLRPC.py:408(BindingXMLRPC.vmGetStats)
  8321    0.669   99.654  vdsmInterface.py:87(vdsmInterface.getVmMemoryStats)
  8321    0.583   94.921  vdsmInterface.py:142(vdsmInterface.getVmBalloonInfo)


3.5:

ncalls  tottime  cumtime
  1976    0.130    8.933  API.py:1274(Global.getAllVmStats)
  7950    0.802   43.623  Bridge.py:249(DynamicBridge._dynamicMethod)
  8727    1.335   75.072  vdsmInterface.py:87(vdsmInterface.getVmMemoryStats)
  8727    1.257   72.633  vdsmInterface.py:142(vdsmInterface.getVmBalloonInfo)
  8727    1.245   71.141  vdsmInterface.py:166(vdsmInterface.getVmCpuTuneInfo)

vdsmInterface.py:166(vdsmInterface.getVmCpuTuneInfo) is responsible for the
difference - this is the cost of adding this feature.

Bridge.py:249(DynamicBridge._dynamicMethod) (jsonrpc) is calling this function
7950 times while in 3.4 BindingXMLRPC.py:408(BindingXMLRPC.vmGetStats) call
it only 2400 times - we should understand why.


3. MultiProtocolAcceptor._process_events
----------------------------------------
 
protocoldetector.py:94(MultiProtocolAcceptor._process_events) takes 3.3% of
the cpu time in 3.5 - this is already fixed in 3.5.1 - see bug 1177533


4. MOM
------

spark.py:211(Parser.buildState) is taking much more time on 3.5 - it is called
much more and consume 40 seconds *in* the fuction in 3.5, but 0 seconds in 
3.4 - look like a bad change in this function.

3.4:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  47369    0.000    0.000   52.608    0.001 spark.py:211(Parser.buildState)

3.5:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 157109   40.947    0.000   91.926    0.001 spark.py:211(Parser.buildState)


I suggest to repeat this profiling on much bigger machine so we can check
handling of much more vms.

Comment 82 Eldad Marciano 2015-01-18 15:59:30 UTC

Created attachment 981166 [details]
vdsmd_82vms_pure_json_1800sec_cpu

Comment 83 Michal Skrivanek 2015-01-19 08:24:52 UTC

(In reply to Nir Soffer from comment #81)

> 2. VM.getStats
> --------------
> 
> vm.py:2849(Vm._getRunningVmStats) takes 9.3% of cpu time, and we don't see
> related code in the top of the 3.4 profile. If we follow the function calling
> this in 3.5, we see that the it is called by API.py:(VM.getStats).
> 
> Comparing 3.4 and 3.5 results for this function:
> 
> 3.4:
> 
>  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>   19042    1.360    0.000  210.361    0.011 API.py:349(VM.getStats)
> 
> 3.5:
> 
>  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>   36131    4.779    0.000  271.401    0.008 API.py:355(VM.getStats)
> 
> 3.5 is spending more time in this function, and calling it much more.
> 
> Looking in the callers:
> 
> 3.4:
> 
> ncalls  tottime  cumtime
>   2400    0.108   15.787  BindingXMLRPC.py:408(BindingXMLRPC.vmGetStats)
>   8321    0.669   99.654  vdsmInterface.py:87(vdsmInterface.getVmMemoryStats)
>   8321    0.583   94.921 
> vdsmInterface.py:142(vdsmInterface.getVmBalloonInfo)
> 
> 
> 3.5:
> 
> ncalls  tottime  cumtime
>   1976    0.130    8.933  API.py:1274(Global.getAllVmStats)
>   7950    0.802   43.623  Bridge.py:249(DynamicBridge._dynamicMethod)
>   8727    1.335   75.072  vdsmInterface.py:87(vdsmInterface.getVmMemoryStats)
>   8727    1.257   72.633 
> vdsmInterface.py:142(vdsmInterface.getVmBalloonInfo)
>   8727    1.245   71.141 
> vdsmInterface.py:166(vdsmInterface.getVmCpuTuneInfo)
> 
> vdsmInterface.py:166(vdsmInterface.getVmCpuTuneInfo) is responsible for the
> difference - this is the cost of adding this feature.
> 
> Bridge.py:249(DynamicBridge._dynamicMethod) (jsonrpc) is calling this
> function
> 7950 times while in 3.4 BindingXMLRPC.py:408(BindingXMLRPC.vmGetStats) call
> it only 2400 times - we should understand why.
> 

note the vdsmInterface is MOM, the only part in vdsm calling getStats is actually more effective in 3.5. The getAllVmStats should be called exactly 900 times (every 15s), since it's not it's either the monitoring script or MOM causing extra calls

Comment 84 Michal Skrivanek 2015-01-19 08:41:14 UTC

Eldad, I asked already, can we disable the MOM part (by removing the rules) to see the decrease in the parsing, and possibly in calls to getStats? It should significantly lower the usage

Comment 85 Eldad Marciano 2015-01-19 15:46:41 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c36

Comment 87 Michal Skrivanek 2015-01-19 16:48:27 UTC

(In reply to Eldad Marciano from comment #85)
> https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c36

I want to disable MOM, not NUMA. NUMA is already inactive on your system(and would be interesting to see the effect when enabled)

Comment 90 Eldad Marciano 2015-01-22 16:45:37 UTC

ignoring MOM dramatically reduce the CPU usage.

Comment 91 Eldad Marciano 2015-01-22 16:46:48 UTC

Created attachment 982953 [details]
vdsmd_25vms_json_withnoMOM_1800sec_cpu.prof

Comment 92 Michal Skrivanek 2015-01-23 11:19:51 UTC

(In reply to Eldad Marciano from comment #91)
> Created attachment 982953 [details]
> vdsmd_25vms_json_withnoMOM_1800sec_cpu.prof

unfortunately this is a different thing than advertised. It seems to be running for 259 seconds, includes VM creation.
Still, very helpful, I'm spinning off a new bug for host monitoring flaws;-)

Comment 93 Michal Skrivanek 2015-09-02 12:59:44 UTC

Eldad, since most of the child bugs are in, would it be possible to try to reproduce it in latest 3.6?

Comment 94 Eldad Marciano 2015-09-03 08:56:00 UTC

sure thing, it will be ok to test on top of this versions?

- engine 3.6.0-0.11.master.el6
- vdsm vdsm-4.17.3-1.el7ev.noarch

Comment 95 Michal Skrivanek 2015-09-03 12:57:41 UTC

yes, that's fine. those fixes are in for quite some time

Comment 96 Michal Skrivanek 2015-09-07 08:00:11 UTC

(pending results from scale run)

Comment 97 Nir Soffer 2015-10-18 10:34:22 UTC

Yaniv, the storage side is tracked in bug 1179950 and was fixed long time 
ago and was backported to 3.5.

Removing the storage patch to prevent confusion.

Comment 98 Michal Skrivanek 2015-10-22 11:29:56 UTC

Eldad, any updates?

Comment 99 Red Hat Bugzilla Rules Engine 2015-11-02 12:27:00 UTC

Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 100 Michal Skrivanek 2015-11-16 13:49:21 UTC

(In reply to Red Hat Bugzilla Rules Engine from comment #99)
> Bug tickets that are moved to testing must have target release set to make
> sure tester knows what to test. Please set the correct target release before
> moving to ON_QA.

Dear bot,
I'm sure tester has enough information where to test it. We have a Build ID field. I believe the status should have the priority, messing with status is creating more chaos than missing target release field

Comment 101 Yaniv Lavi 2015-11-16 15:14:55 UTC

(In reply to Michal Skrivanek from comment #100)
> (In reply to Red Hat Bugzilla Rules Engine from comment #99)
> > Bug tickets that are moved to testing must have target release set to make
> > sure tester knows what to test. Please set the correct target release before
> > moving to ON_QA.
> 
> Dear bot,
> I'm sure tester has enough information where to test it. We have a Build ID
> field. I believe the status should have the priority, messing with status is
> creating more chaos than missing target release field


Build ID is going away soon for milestones and releases.
We had many bug bounce back due to wrong package being tested without package version.

Comment 102 Eldad Marciano 2015-11-17 13:43:18 UTC

(In reply to Michal Skrivanek from comment #98)
> Eldad, any updates?

Not yet, we have to deal with other priorities, update ASAP.

Comment 103 Eldad Marciano 2016-01-31 14:33:40 UTC

verified on top of vdsm-4.17.12-0.el7ev.noarch (rhevm 3.6.2)

using default settings + CPU affinity on.
response time is ~1.2 sec:

[root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null 

real    0m1.268s
user    0m0.952s
sys     0m0.075s
[root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null 

real    0m1.276s
user    0m0.972s
sys     0m0.055s
[root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null 

real    0m1.272s
user    0m0.945s
sys     0m0.041s

Comment 104 Sven Kieske 2016-02-25 12:54:57 UTC

(In reply to Eldad Marciano from comment #103)
> verified on top of vdsm-4.17.12-0.el7ev.noarch (rhevm 3.6.2)
> 
> using default settings + CPU affinity on.
> response time is ~1.2 sec:
> 
> [root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null 
> 
> real    0m1.268s
> user    0m0.952s
> sys     0m0.075s
> [root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null 
> 
> real    0m1.276s
> user    0m0.972s
> sys     0m0.055s
> [root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null 
> 
> real    0m1.272s
> user    0m0.945s
> sys     0m0.041s

did you test with at least 66 vms running?

thanks

Sven

Note You need to log in before you can comment on or make changes to this bug.