Description of problem: the following method 'getAllVmStats' running very slow 22 sec. wide impact. hit the vdsm performance. hit the engine performance. Version-Release number of selected component (if applicable): 3.5 VT 13.5 How reproducible: 100% Steps to Reproduce: 1. run 66 vms 2. using vdsClient -s 0 getAllVmStats 3. Actual results: low statistics scan. Expected results: fast results. Additional info:
there is some findings related to this bug: the actual API call delayed due to some resources starvation. by the following strace can we saw some system wait call while using vdsClient access("/usr/bin/python", R_OK) = 0 rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fedaff369d0) = 10817 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGINT, {0x43d2f0, [], SA_RESTORER, 0x3335c326a0}, {SIG_DFL, [], SA_RESTORER, 0x3335c326a0}, 8) = 0 wait4(-1, # here we wait like up to 4 sec and sometimes even more like 10 sec. by using cproiler can be seen delays on the following: 412823 function calls (411408 primitive calls) in 20.331 CPU seconds ncalls tottime percall cumtime percall filename:lineno(function) 6 0.000 0.000 19.535 3.256 socket.py:409(readline) 1 0.000 0.000 20.166 20.166 vdsClient.py:466(do_getAllVmStats) 1 0.000 0.000 20.157 20.157 xmlrpclib.py:1479(__request) 1 0.036 0.036 20.337 20.337 {execfile} further investigation required.
Created attachment 974121 [details] python profiler results
Created attachment 974122 [details] strace log
Eldad, When opening a bug please give all the environmental relevant details: 1 - system you are running it on (1 engine 1 host ? hosted engine ?) 2 - the title implies that hooks delay the response , what hooks are installed ? what is the time when no hooks are used ? 3 - vdsm logs ? 4 - is this a new test (only 3.5 ?) or do you have the numbers also on 3.4 ?
1. setup distribution: -1 engine -1 cluster \ sd -37 hosts -3K vms. hosts distribution (engine + hosts): 24 cores, 64GB RAM, 2. about the hooks, I meant to hooks.py module it used by the API module in some of basics action like 'getAllVmStats' 3. there is no errors at the logs are beacuse of that i didn't place them here.
4. it's not a new test, we just using real (physical hardware).
(In reply to Eldad Marciano from comment #6) > 1. setup distribution: > -1 engine > -1 cluster \ sd > -37 hosts > -3K vms. > And the issue occurs on all hosts? > hosts distribution (engine + hosts): > 24 cores, 64GB RAM, > > 2. about the hooks, I meant to hooks.py module it used by the API module in > some of basics action like 'getAllVmStats' > But how do you know that the reason for the performance hit is this specific module? Are there any hooks applied? > 3. there is no errors at the logs are beacuse of that i didn't place them > here. Logs are important anyhow.
i used jprofiler at the engine side to understand why start vms getting slow during our ramp up process. the vdsbroker.getallvmstat.VDSCommand running ~20 sec. and also other vds api methods. at the vdsm side i used the vdsClient to trigger getallvmstats and using python profiler at the same time and clearly can be seen the problem. bottom line both engine and vdsm profilers point about long time execution at the vdsm side. i suggest to set up debug session together since the logs are huge.
O.K., So this is an issue that was identified by you as a problem (66 vms with the above configuration on a single host) - getAllVmStats takes 22 sec. IIUC since you have already used the python profiler to identify that hooking.py takes a significant role in the delay (assuming no hooks were actually deployed), Than there is some other info I would like to see/know: 1 - profile of a single VM (mem, disks, nics, OS, with guest agent ?) 2 - can you share the entire single flow profiler output (assuming the engine did not hammer that vdsm on that time with additional API calls) 3 - what were the results for this test on 3.4 ? 4 - what was the load (lets start with load avarage & mem consumption) on the hypervisor at that ?
1. single vm profile: 1 CPU, 256 MB RAM, 10GB disk (thin provision), OS rhel 6.5 with guest agent). 2. Already done strace of 'vdsClient -s 0 getAllVmStats' and also the pytohn profiler results attached to bug. as you suggest engine was shout down to avoid other api calls. and i get the same results. i have added some more logs in the code to understand batter the flow. results for 'time vdsClient -s 0 getAllVmStats' (with no engine running) : ... real 0m22.298s #total time user 0m0.820s sys 0m0.157s by the logs the actual api call time is 6 sec: Thread-198511::DEBUG::2014-12-31 09:31:36,298::API::1296::vds::(getAllVmStats) STATLOG: total by 6.83962011337 sec Thread-198511::DEBUG::2014-12-31 09:31:36,307::BindingXMLRPC::487::vds::(getAllVmStats) STATLOGFROMXML getAllVmStats time is 6.852 there is more 16 sec spent on waiting (see the strace output i mention above). there is another thing that results are inconsistent. (we always having delays between few seconds up to ~22 sec). 3. we have not 3.4 setup so i can tell what the status there we have also tested the 3.4 via simulator, this bug affected from real. 4. load statistics: load in average ~20%, memfree ~40GB.
Nir can you please take a look and tell us where was the vdsm time spent ? Keep in mind that when using vdsClient than the protocol used is XML-RPC
getAllVmsState always took some time to return when running 20+ vms. even less.. it gathers many deatils about the vms that retreive partially from libvirt. running 50+ vms will effect the time obviously. we tried already to improve the timing for that verb, but this is not something new
this is very bad. as part of the scale effort we should handle more than 100 vms each host. in reasonable time. see the following timing results for 66 vms: ]# time vdsClient -s 0 getAllVmStats &> /dev/null real 0m43.798s user 0m1.044s sys 0m0.139s by the logs it looks like that: tail -f /var/log/vdsm/vdsm.log |grep 'TRACEHOOKS\|STATLOG' |grep -v 'GuestMonitor' Thread-233::DEBUG::2015-01-01 18:10:29,461::API::1281::vds::(getAllVmStats) STATLOG: before_get_all_vm_stats takes 0.583 sec Thread-233::DEBUG::2015-01-01 18:10:39,266::API::1290::vds::(getAllVmStats) STATLOG: collect vms statistics 1 by 1 takes 7.496 sec Thread-233::DEBUG::2015-01-01 18:10:40,331::API::1294::vds::(getAllVmStats) STATLOG: getAllVmStats pars statistics runs by 0.802846908569 sec Thread-233::DEBUG::2015-01-01 18:10:40,614::API::1296::vds::(getAllVmStats) STATLOG: total by 11.7333049774 sec Thread-233::DEBUG::2015-01-01 18:10:40,896::BindingXMLRPC::487::vds::(getAllVmStats) STATLOGFROMXML getAllVmStats time is 12.023 as you can see the actual vms data collecting + parsing takes only 12 sec(which is not so fast). and the overall vdsClient time takes 43 sec. we have a huge gap.
Eldad, what we need to do is profile vdsm - not vdsclient, because we want to understand vdsm performance and not vdsclient. Profiling using cprofile is useless because it profiles only all the threads in vdsm, and cprofile is profiling only the thread where the profiler was started. You need install yappi on the host where you want to profile. The best way is using pip (you may need to install pip first): pip install yappi Then you should enable profiling in vdsm configuration: # /etc/vdsm/vdsm.conf [vars] profile_enable = true profile_builtins = true profile_clock = cpu And stop vdsm. To create a profile: 1. start vdsm 2. start the vms you need for the tests 3. run vdsm with normal conditions from engine for 5-10 minutes 4. stop vdsm The profile is saved in /var/run/vdsm/vdsmd.prof - copy this file elsewhere because it will be overwritten in the next time vdsm stops. Disable profiling before starting vdsm again. [vars] profile_enable = false Please try to get this profile and upload the file here. From the profile we can useful statistics. The easiest way is using contrib/profile-stats from vdsm source.
From initial profile, we have one issue with O(N^2) search in NfsStorageDomain.getAllVolumes. [root@host06-rack04 ~]# ./profile-stats -c -scumulative vdsmd.prof Sun Jan 4 17:39:11 2015 vdsmd.prof 24867930 function calls (25536205 primitive calls) in 1450.530 CPU seconds Ordered by: cumulative time List reduced from 2723 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 89868/89869 361.385 0.004 361.385 0.004 fileSD.py:444(<genexpr>) 117/118 0.026 0.000 359.611 3.048 dispatcher.py:65(wrapper) 117/118 0.010 0.000 359.325 3.045 task.py:99(Task.wrapper) 117/118 0.038 0.000 359.315 3.045 task.py:1155(Task.prepare) 117/118 0.005 0.000 355.929 3.016 task.py:869(Task._run) 117/118 0.061 0.001 355.902 3.016 logUtils.py:40(wrapper) 8/9 0.000 0.000 353.487 39.276 vm.py:2339(Vm.preparePaths) 24/25 0.001 0.000 353.485 14.139 clientIF.py:267(clientIF.prepareVolumePath) 8/9 0.057 0.007 353.311 39.257 hsm.py:3188(HSM.prepareImage) 8/9 0.000 0.000 352.837 39.204 fileSD.py:414(NfsStorageDomain.getAllVolumes) Ordered by: cumulative time List reduced from 2723 to 20 due to restriction <20> Function was called by... ncalls tottime cumtime fileSD.py:444(<genexpr>) <- 89869/89868 361.385 361.385 fileSD.py:414(NfsStorageDomain.getAllVolumes) In getAllVolumes, if all images are created from same template, is equal to this: for metaPath in volMetaPaths: if len(tuple(vPath for vPath in volMetaPaths if imgUUID in vPath)) > 1: # Add template additonal image volumes[volUUID]['imgs'].append(imgUUID) else: # Insert at head the template self image volumes[volUUID]['imgs'].insert(0, imgUUID) Eldad: - How many disks do we have on the same storage domain? - Are all disks created from same template?
Yep
Created attachment 976456 [details] 20 vms CPU TIME
Created attachment 976457 [details] 64 vms CPU time
Created attachment 976458 [details] 20 vms WALL TIME
Created attachment 976459 [details] 64 vms WALL TIME
adding some profile's results. there is two type of profiling CPU and WALL time. by the WALL time can be seen some wait \ and sleep method which probably make troubles.
[Tentatively targeting for 3.5.1 - once the patch is up there for master, let's re-discuss if it's worth the risk/effort for 3.5.1/2/3].
Eldad, can you test this patch on one of your hosts? I would like to see profiling results with this patch - I expect that getAllVolumes will disappear from the profile, and we will have less noise checking why getAllVMStats is too slow.
looking at 20 VM profile I can say - Nir's getAllVolumes - XML filtering in guest agent is very ineffective (per character) - MOM XML parsing eats a lot - no particular problem with getAllVMStats, threading synchronization seem to take its toll even at 20 VMs - JSON RPC parsing takes some time, but that's understandable Nir, can you pleae help me understand the numaUtils.py:22(<module>)?
identified an issue with _getNumaStats which does xml parsing via superVdsm every single time we do stats get() (similar to a balloonInfo problem not that long time ago). Still I'm curious why is it showing like that in the profiling data:)
(In reply to Michal Skrivanek from comment #26) > identified an issue with _getNumaStats which does xml parsing via superVdsm > every single time we do stats get() (similar to a balloonInfo problem not > that long time ago). Still I'm curious why is it showing like that in the > profiling data:) The "numaUtils.py:22(<module>)" may be a bug in yappi or pstat module. We had similar issues on Python 2.6, see https://code.google.com/p/yappi/issues/detail?id=45 The best way to handle this is to profile also on Python 2.7, where we may see correct output.
Checking scale setup with 64 vms with patch http://gerrit.ovirt.org/36593, we can see that vdsm is completely overloaded. In this state, running 150-175 threads, anything will be barely usable. We see vdsm cpu usage typically at 1000-1500% on 24 cores machines. This consume about 50% of system cpu time. With non-idle vms, vdsm response time will be much worse. getAllVmStats is becoming slow is just a side effect of this overloading. To continue with this, please do: 1. Find the amount of vms that keep vdsm using about 75% cpu time looking at a machine running 20 vms, this should be around 40 vms. 2. Do cpu and wall time profile in this load for the same time (e.g. 30 minutes) You can script this: service vdsmd start sleep 1800 service vdsmd stop mv /var/run/vdsm/vdsmd.prof vdsm-test-name.prof 3. Upload both profiles here, or maybe open a new bug for each issue found as the getAllVolumes issue is resolved. Repeat this test with: - rhev 3.4.4 - to detect regressions since last release - rehv 3.5 using xmlrpc - to detect regression related to jsonrpc Note that all profiles must use same time and workload, so we can compare them for example: 40 vms, 30 minutes, rhev-3.4, rhev-3.5-jsonrpc, rhev-3.5-xmlrpc. Please keep the need info flag until we have these profiles.
Created attachment 976919 [details] 82 vms CPU
Created attachment 976920 [details] 82 vms WALL
(In reply to Eldad Marciano from comment #29) > Created attachment 976919 [details] > 82 vms CPU is this with the NUMA stats commented out and cleaned MOM policies? What was the overall system load, vdsm CPU usage?
(In reply to Eldad Marciano from comment #29) > Created attachment 976919 [details] > 82 vms CPU Eldad, did you read comment 28? Please keep the need info flag until we have *all* the information requested.
(In reply to Michal Skrivanek from comment #26) > identified an issue with _getNumaStats which does xml parsing via superVdsm > every single time we do stats get() (similar to a balloonInfo problem not > that long time ago). Still I'm curious why is it showing like that in the > profiling data:) Indeed the NUMA code is, let's say, very near to the top in the chart of code in need of cleanup and optimization (And yes, it is mostly my bad to have it accepted on it's current form). I also wonder why it started so lately to show up in the profiles. As short term fix, which I believe it is also feasible for 3.5.x, I crafted http://gerrit.ovirt.org/#/c/36626/
Just to calrify the noise in here. this bug is about getAllVMStats critical path, which means now the only relevant issue for this flow looks like the inefficient mom (_getNumaStats) part. Eldad we need to see whether the getAllVMStats improved when the getNumaStats is commented. Ignore the other issues for now they will require different bugs.
the guest agent filtering is now being tracked as bug 1179696. it is not be relevant to the getAllVMStats responsiveness, also there are no known reports from the field and the code is unchanged since about 3.3.z
by comment out the the numa stat invoke the results are same for 82 vms running: the following patch http://gerrit.ovirt.org/#/c/36626/ applied too Wed Jan 7 14:27:52 2015 vdsmd_82vms_numacommented_fileSD_numUtils_patched_cpu.prof 49280718 function calls (51228879 primitive calls) in 17517.372 CPU seconds Ordered by: internal time List reduced from 2772 to 1000 due to restriction <1000> ncalls tottime percall cumtime percall filename:lineno(function) 19224/19358 1789.695 0.093 6271.835 0.324 threading.py:481(Thread.run) 1890704 1734.238 0.001 1763.005 0.001 numaUtils.py:22(<module>) 41892 1341.277 0.032 3766.584 0.090 vm.py:2849(Vm._getRunningVmStats) 2116685/2116903 590.606 0.000 2356.711 0.001 guestagent.py:80(<genexpr>) 17556/581334 534.025 0.030 3128.514 0.005 guestagent.py:87(filt) 908550 451.645 0.000 451.645 0.000 utils.py:425(convertToStr) 17879/34159 436.846 0.024 982.969 0.029 decoder.py:162(JSONObject) 149859 422.248 0.003 847.468 0.006 expatbuilder.py:743(ExpatBuilderNS.start_element_handler) 9570/612060 268.384 0.028 358.737 0.001 minidom.py:305(_get_elements_by_tagName_helper) 41892 267.465 0.006 347.702 0.008 vm.py:2935(Vm._getGraphicsStats)
Created attachment 977345 [details] 82 vms numa invoke commented
(In reply to Eldad Marciano from comment #36) > by comment out the the numa stat invoke the results are same for 82 vms > running: > the following patch http://gerrit.ovirt.org/#/c/36626/ applied too > > Wed Jan 7 14:27:52 2015 > vdsmd_82vms_numacommented_fileSD_numUtils_patched_cpu.prof This means that, as Nir pointed out in https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c27 a probable bug in the profiler. Further evidence supporting this theory is that all code in vdsm/numaUtils.py is run when getVmNumaNodeRuntimeInfo is invoked - this is what getAllVmStats eventually does. However, the second line of code in that function is if 'guestNumaNodes' in vm.conf: And this cover the whole function. So, unless Engine sends explicit NUMA configuration, the cost of that function is negligible - and it returns just an empty dict(). That configuration could be sent only when VM is created. I run a quick grep on the affected host, and that configuration tunable is NOT present. I grep'd all the available logs and not only the one, but Eldad, please confirm if I'm wrong or not. If the above is correct, the bottleneck is anywhere else. However, my gut feeling is just VDSM is overloaded, and to properly solve that we need deep changes (e.g. thread reduction, more C speedups).
hm, then the question is if we can trust the debugging profile at all... Nir suggested Python 2.7, maybe we can check if we get different results?
The getAllVolumes issue is track now in bug 1179950.
(In reply to Nir Soffer from comment #40) > The getAllVolumes issue is track now in bug 1179950. Can you also please change the bug's title to reflect this?
if we assume problem with the profiler in NUMA related code then the next CPU bottleneck seems to be the guest agent filtering, so it make sense to give it a try with disabled guest agent inside the VMs it's not relevant ini the getAllVmStats call but it can/should bring down the CPU usage significantly...so would be worth reviewing the overall host load and vdsm total CPU usage then, and analyzing the profile further if the responsivness is still bad
Gil, I don't believe this is a blocker. It is not a recent regression, rather a scaling issue we've developed over the time
Michael, I've asked Eldad to try the same flow with 3.4, to make sure it is not a performance degradation. If we won't see a degradation when comparing to 3.5 to 3.4, I'm fine with removing the blocker flag.
Doron, we need someone from SLA to start/keep looking at the NUMA issue. Even though it turns out it may not be the actual problem on Eldad's setup as NUMA is not in use (hinting at a problem with the debugger). But this will most certainly be a big issue once NUMA is enabled. So that relevant code needs to change ASAP
by investigating 3.4.4 (av14) regression were found. both 3.4 and 3.5 tested by the following conditions: -82 vms -1800 sec CPU profiling. -same storage. -same hardware. 3.4 vdsmd CPU% is ~500%. (peaks 850%). 3.5 vdsmd CPU% is ~1600% (peaks 2100%). adding the profiler results.
Created attachment 978742 [details] 3.4 av14 82 vms profile
Based on comment #47, I suggest we move this BZ back to 3.5GA, and provide a fix to the performance degradation.
(In reply to Eldad Marciano from comment #47) > by investigating 3.4.4 (av14) regression were found. > > both 3.4 and 3.5 tested by the following conditions: > -82 vms > -1800 sec CPU profiling. > -same storage. > -same hardware. > > 3.4 vdsmd CPU% is ~500%. (peaks 850%). > 3.5 vdsmd CPU% is ~1600% (peaks 2100%). > > adding the profiler results. Eldad, interesting results, but please read again comment 28 and comment 45. What you tested is not supported usage of vdsm. It is overloaded in both 3.4.4 and 3.5. For useful profiles, we need a profile when vdsm average cpu usage is less then 100%. For 3.5 profiles, we need both xmlrpc and jsonrpc, to detect if the this is a jsonrpc related issue.
(In reply to Eldad Marciano from comment #47) > by investigating 3.4.4 (av14) regression were found. > > both 3.4 and 3.5 tested by the following conditions: > -82 vms > -1800 sec CPU profiling. > -same storage. > -same hardware. > > 3.4 vdsmd CPU% is ~500%. (peaks 850%). > 3.5 vdsmd CPU% is ~1600% (peaks 2100%). > > adding the profiler results. In addition to what Nir pointed out in https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c50 : 3.5 vdsmd is running with NUMA disabled (code commented out), right?
adding back needinfo from comment #46
(In reply to Michal Skrivanek from comment #52) > adding back needinfo from comment #46 Following one of the profiles: $ profile-stats -c vdsmd_20vms_CPUTIME.prof Mon Jan 5 21:44:50 2015 vdsmd_20vms_CPUTIME.prof 186697006 function calls (190034885 primitive calls) in 5019.501 seconds Ordered by: internal time List reduced from 2629 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 214840 866.080 0.004 866.080 0.004 fileSD.py:444(<genexpr>) 309450/309987 233.689 0.001 283.696 0.001 spark.py:211(Parser.buildState) 2298459 190.133 0.000 203.422 0.000 numaUtils.py:22(<module>) We can use the callers sections to tell who is calling this function: Ordered by: internal time List reduced from 2629 to 20 due to restriction <20> Function was called by... ncalls tottime cumtime fileSD.py:444(<genexpr>) <- 214840 866.080 866.080 fileSD.py:414(NfsStorageDomain.getAllVolumes) spark.py:211(Parser.buildState) <- 309450 232.895 283.696 spark.py:187(Parser.parse) 537/0 0.793 4.649 spark.py:211(Parser.buildState) numaUtils.py:22(<module>) <- 2298458 190.133 203.422 guestagent.py:80(<genexpr>) 1 0.000 0.000 vm.py:23(<module>) This is guestagent.py:80<genexpr> - looking at line 80: 80 return ''.join(filterXmlChar(c) for c in u) This is part of guestagent._filterXMLChars() Sorting by cumulative time, we can see that is guestagent channel listening thing: $ profile-stats -c -r50 -scumulative vdsmd_20vms_CPUTIME.prof Mon Jan 5 21:44:50 2015 vdsmd_20vms_CPUTIME.prof 186697006 function calls (190034885 primitive calls) in 5019.501 seconds Ordered by: cumulative time List reduced from 2629 to 50 due to restriction <50> ncalls tottime percall cumtime percall filename:lineno(function) <snip> 35457 6.519 0.000 566.485 0.016 guestagent.py:442(GuestAgent._onChannelRead) 19288/19290 5.414 0.000 557.347 0.029 guestagent.py:419(GuestAgent._handleData) 21218/21224 1.984 0.000 550.738 0.026 guestagent.py:411(GuestAgent._processMessage) 21224 2.141 0.000 531.987 0.025 guestagent.py:461(GuestAgent._parseLine) 75550 8.230 0.000 493.572 0.007 vm.py:2804(Vm.getStats) 14160/34416 0.268 0.000 488.231 0.014 Parser.py:423(Evaluator.c_if) 80648 11.704 0.000 435.757 0.005 spark.py:187(Parser.parse) 19356 6.237 0.000 434.263 0.022 __init__.py:461(JsonRpcServer._serveRequest) 75550 132.304 0.002 420.172 0.006 vm.py:2849(Vm._getRunningVmStats) 20236/26978 1.953 0.000 415.520 0.015 Parser.py:389(Evaluator.c_let) 19356 4.012 0.000 363.549 0.019 Bridge.py:249(DynamicBridge._dynamicMethod) 21224 1.101 0.000 362.481 0.017 guestagent.py:83(_filterObject) 21224/709226 54.907 0.003 361.380 0.001 guestagent.py:87(filt) 275340 21.559 0.000 298.404 0.001 guestagent.py:47(_filterXmlChars) 130234/130241 4.935 0.000 287.784 0.002 __init__.py:1034(Logger.debug) 645047/645119 9.435 0.000 286.679 0.000 __builtin__:0(<method 'join' of 'str' objects>) 309450/309987 233.689 0.001 283.696 0.001 spark.py:211(Parser.buildState) 103302 6.133 0.000 283.099 0.003 __init__.py:1155(Logger._log) 2573777/2573798 65.788 0.000 269.210 0.000 guestagent.py:80(<genexpr>) Looks like the numaUtils line is a bug in the profiler, probbably generator expressions are not recorded correctly.
(In reply to Nir Soffer from comment #53) > Looks like the numaUtils line is a bug in the profiler, probbably > generator expressions are not recorded correctly. Makes sense. In a former test numaUtils code was disabled by commenting it out enterily, and VDSM stats (cpu usage) hasn't improved significantly.
Created attachment 979147 [details] 25 vms low cpu version 3_4_4
(In reply to Eldad Marciano from comment #55) > Created attachment 979147 [details] > 25 vms low cpu version 3_4_4 would you please always add rouch CPU/load usage during the test? also, I don't see the guestagent filtering in this test, was it there and working?
82 vms is irrelevant due to excessive overcommitment of vcpus/cores. The load on the system is ~60, but it is still working, so no issue really I want to see at most 20-25 vms and compare the total cpu usage of vdsm, in the same conditions, to be able to tell anything
Created attachment 979602 [details] 25 vms pure xml rpc version 3.5
Created attachment 979614 [details] 25 vms pure json rpc version 3.5
Nir, regarding the jsom vs xml rpc comparison, i have added the profiler results. by first and short analysis, the following modules are the most heavy (xml and json both): dispatcher.py, HSM.prepareImage, API.py:355(VM.getStats) clientIF.py:267(clientIF.prepareVolumePath) logUtils.py:40(wrapper) I think our next step is tackle each one of those and easily raise the number of running vms.
On the affected host something's not looking right. I mean besides VDSM slowness :) Here's a sample from a box of mine running RHEL 6.6 and VDSM 4.17 from master: 100 VMs without GA (nor OS installed) - but only for memory constaints: if I create VMs with enough RAM to be able to run an OS with GA, I can't run enough of them to stress the box, so it is a tradeoff. If it is needed, I can run less VMs, with GA and with 3.5.0 VDSM. top - 14:33:04 up 1:03, 1 user, load average: 0.30, 1.41, 2.27 Tasks: 593 total, 2 running, 590 sleeping, 0 stopped, 1 zombie Cpu(s): 10.0%us, 16.3%sy, 0.0%ni, 73.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16215968k total, 3376340k used, 12839628k free, 101440k buffers Swap: 8175612k total, 0k used, 8175612k free, 364252k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3679 vdsm 0 -20 6732m 214m 9892 S 94.9 1.4 4:23.01 vdsm 2445 root 20 0 1145m 20m 5532 R 44.8 0.1 6:32.71 libvirtd We know that VDSM eats CPU and burns libvirt: this very data is further confirmation. But the numbers on the affected box seem to bad to be true.
(In reply to Francesco Romani from comment #61) > On the affected host something's not looking right. I mean besides VDSM > slowness :) > Here's a sample from a box of mine running RHEL 6.6 and VDSM 4.17 from > master: TOji> 14:33:04 root [~]$ rpm -qa | grep libvirt libvirt-0.10.2-46.el6_6.2.x86_64 TOji> 14:37:24 root [~]$ rpm -qa | grep vdsm vdsm-jsonrpc-4.17.0-166.git773c9db.el6.noarch vdsm-python-zombiereaper-4.16.8-8.gitf16901f.el6.noarch vdsm-debuginfo-4.17.0-166.git773c9db.el6.x86_64 vdsm-cli-4.17.0-166.git773c9db.el6.noarch vdsm-python-4.17.0-166.git773c9db.el6.noarch vdsm-infra-4.17.0-166.git773c9db.el6.noarch vdsm-yajsonrpc-4.17.0-166.git773c9db.el6.noarch vdsm-xmlrpc-4.17.0-166.git773c9db.el6.noarch vdsm-4.17.0-166.git773c9db.el6.x86_64 TOji> 14:37:29 root [~]$ rpm -qa | grep kernel kernel-2.6.32-431.23.3.el6.x86_64 TOji> 14:37:34 root [~]$ rpm -qa | grep qemu qemu-img-rhev-0.12.1.2-2.415.el6_5.14.x86_64 gpxe-roms-qemu-0.9.7-6.12.el6.noarch qemu-kvm-rhev-0.12.1.2-2.415.el6_5.14.x86_64 qemu-kvm-rhev-tools-0.12.1.2-2.415.el6_5.14.x86_64
(In reply to Francesco Romani from comment #61) > On the affected host something's not looking right. I mean besides VDSM > slowness :) > Here's a sample from a box of mine running RHEL 6.6 and VDSM 4.17 from ... > We know that VDSM eats CPU and burns libvirt: this very data is further > confirmation. But the numbers on the affected box seem to bad to be true. You are testing with block storage with 100 disks (not based on templates?), but Eladad is using NFS with 3000 disks based on template (6000 volumes).
(In reply to Eldad Marciano from comment #59) > Created attachment 979614 [details] > 25 vms pure json rpc version 3.5 Thanks Eldad!
(In reply to Eldad Marciano from comment #58) > Created attachment 979602 [details] > 25 vms pure xml rpc version 3.5 Eldad, can you tell us how much time each profile was running, and what was the average cpu usage for each version? We want to see not only which function is on the top of the profile, but how many call were done for each variant.
both tests runs for ~900 sec. the cpu average was ~5-10%.
(In reply to Eldad Marciano from comment #66) > both tests runs for ~900 sec. > the cpu average was ~5-10%. and it was all working +- ok, right? So then this is not really a blocker, nor regression between xmlrpc vs jsonrpc you didn't add the average cpu usage on 3.4, but I would guess based on the profiles it was in the same ballpark as 3.5... (retargeting to 3.5.1 as internal testing and examination of running system doesn't show anything horrendous)
another interesting element by looking the profile analysis logUtils very hit the performance. task.py:1155(Task.prepare) <- 159/158 0.043 387.273 task.py:99(Task.wrapper) task.py:869(Task._run) <- 159/158 0.006 383.882 task.py:1155(Task.prepare) logUtils.py:40(wrapper) <- 159/158 0.089 383.862 task.py:869(Task._run) vm.py:2339(Vm.preparePaths) <- 16/15 0.000 379.952 clientIF.py:491(clientIF._recoverVm) clientIF.py:267(clientIF.prepareVolumePath) <- 48/47 0.003 379.950 vm.py:2339(Vm.preparePaths) hsm.py:3188(HSM.prepareImage) <- 16/15 0.000 379.709 logUtils.py:40(wrapper)
(In reply to Eldad Marciano from comment #60) > Nir, regarding the jsom vs xml rpc comparison, i have added the profiler > results. > by first and short analysis, the following modules are the most heavy (xml > and json both): > > dispatcher.py, > HSM.prepareImage, > API.py:355(VM.getStats) Here some patches for review to improve getStats/getAllVmStats performances. These patches are intentionally kept small in size to make them backportable to stable branch. We all know that we'll most likely need much deeper changes to improve significantly. This improves a little Vm._getGraphicsStats: http://gerrit.ovirt.org/#/c/36791/ It is a straightforward and quite safe fix, albeit of little help - this method is very low on profile getAllVmStats seems suspicious, here's an (admittedly debatable) attempt to make it faster: http://gerrit.ovirt.org/#/c/35140/1 Speaking of Vm._getRunningVmStats(), the problem here is that the code _has_ to do some expensive translation/sanitizations. This is also the source of most (if not all) the calls to convertToStr. To fix this is feasible, but we'll probably need a microbenchmark/localized test suite, and I'm not sure it is OK for stable branch -not anytime soon, anyway. All that said, I'm still not convinced about the existence of the _major_ performance regression that this BZ claims.
(In reply to Michal Skrivanek from comment #67) > (In reply to Eldad Marciano from comment #66) > > both tests runs for ~900 sec. > > the cpu average was ~5-10%. > > and it was all working +- ok, right? So then this is not really a blocker, > nor regression between xmlrpc vs jsonrpc > > you didn't add the average cpu usage on 3.4, but I would guess based on the > profiles it was in the same ballpark as 3.5... > > (retargeting to 3.5.1 as internal testing and examination of running system > doesn't show anything horrendous) I asked Eldad to do new 3.5 profiles for 1800 seconds, so we can compare with the 3.4 profile. Lets wait until we have enough data.
Created attachment 979928 [details] 25 vms 1800 sec 3_5 jsonrpc
Created attachment 979929 [details] 25 vms 1800 sec 3_4
finally apples to apples comparison. Please add high level observation of system load and vdsm CPU usage from the profiles it looks like there's no real difference between 3.4 and 3.5
opened follow up bugs on SLA issues around MOM(bug 1182097) and NUMA (bug 1182094)
the cpu average was ~5-10% both. but can be seen difference and regression by running large scale vms. https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c47
(In reply to Nir Soffer from comment #71) > (In reply to Michal Skrivanek from comment #67) > > (In reply to Eldad Marciano from comment #66) > > > both tests runs for ~900 sec. > > > the cpu average was ~5-10%. > > > > and it was all working +- ok, right? So then this is not really a blocker, > > nor regression between xmlrpc vs jsonrpc > > > > you didn't add the average cpu usage on 3.4, but I would guess based on the > > profiles it was in the same ballpark as 3.5... > > > > (retargeting to 3.5.1 as internal testing and examination of running system > > doesn't show anything horrendous) > > I asked Eldad to do new 3.5 profiles for 1800 seconds, so we can compare > with the 3.4 profile. > > Lets wait until we have enough data. In trying to ignore this this calls which comes from 'task.py task._run' having much better results.
(In reply to Eldad Marciano from comment #76) > the cpu average was ~5-10% both. then we're good;-) decreasing severity and removing zstream request > but can be seen difference and regression by running large scale vms. as I said before, that's irrelevant on such small host. Please get a bigger one, maybe Yuri's setup or something...but you should not have more vcpus than the physical ones
(In reply to Michal Skrivanek from comment #74) > finally apples to apples comparison. Please add high level observation of > system load and vdsm CPU usage > > from the profiles it looks like there's no real difference between 3.4 and > 3.5 There is a difference - assuming that both did run for the same time (1800 seconds), and processed the same amount of requests from engine, the 3.5 version is using 25% more cpu time. Fri Jan 16 23:55:00 2015 vdsmd_25vms_3_4_1800sec_cpu.prof 16047048 function calls (17034423 primitive calls) in 1530.915 seconds Ordered by: internal time List reduced from 1945 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 275100 512.984 0.002 512.984 0.002 fileSD.py:448(<genexpr>) 25 121.375 8.670 121.499 4.860 sampling.py:358(VmStatsThread.run) 44 67.283 4.205 145.745 3.312 threading.py:481(Thread.run) 190183 40.683 0.014 40.689 0.000 minidom.py:305(_get_elements_by_tagName_helper) 48659 40.489 0.001 108.048 0.002 expatbuilder.py:743(ExpatBuilderNS.start_element_handler) 83826 32.884 0.000 35.018 0.000 StringIO.py:208(StringIO.write) 83943 23.057 0.000 23.207 0.000 minidom.py:349(Attr.__init__) 117222 22.567 0.000 25.333 0.000 minidom.py:281(Document._append_child) 62705 22.006 0.000 24.930 0.000 spark.py:103(Parser.addRule) 68482 21.836 0.000 37.521 0.001 expatbuilder.py:274(ExpatBuilderNS.character_data_handler_cdata) Fri Jan 16 23:54:48 2015 vdsmd_25vms_3_5_jsonrpc_1800sec_cpu.prof 80952438 function calls (82245529 primitive calls) in 1921.656 seconds Ordered by: internal time List reduced from 2637 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 275075 611.045 0.002 611.045 0.002 fileSD.py:444(<genexpr>) 36131 56.048 0.002 180.522 0.005 vm.py:2849(Vm._getRunningVmStats) 1673681 41.673 0.000 65.290 0.000 protocoldetector.py:94(MultiProtocolAcceptor._process_events) 157109 40.947 0.000 91.926 0.001 spark.py:211(Parser.buildState) 201940 32.269 0.000 38.662 0.000 spark.py:103(Parser.addRule) 40388 24.992 0.001 29.322 0.001 spark.py:146(Parser.makeFIRST) 16890 23.803 0.001 52.007 0.003 inspect.py:247(getmembers) 926097 21.156 0.000 21.156 0.000 utils.py:425(convertToStr) 80776 19.588 0.000 25.824 0.000 spark.py:28(_namelist) 1712725 19.368 0.000 19.368 0.000 __builtin__:0(poll) 1. FileStorageDomain.getAllVolumes ---------------------------------- fileSD.py:444(<genexpr>) is called by FileStorageDomain.getAllVolumes, which takes about 32% of cpu time. This time is spent only when starting and recovering vms. This issue is tracked in bug 1179950 for 3.5.1. 2. VM.getStats -------------- vm.py:2849(Vm._getRunningVmStats) takes 9.3% of cpu time, and we don't see related code in the top of the 3.4 profile. If we follow the function calling this in 3.5, we see that the it is called by API.py:(VM.getStats). Comparing 3.4 and 3.5 results for this function: 3.4: ncalls tottime percall cumtime percall filename:lineno(function) 19042 1.360 0.000 210.361 0.011 API.py:349(VM.getStats) 3.5: ncalls tottime percall cumtime percall filename:lineno(function) 36131 4.779 0.000 271.401 0.008 API.py:355(VM.getStats) 3.5 is spending more time in this function, and calling it much more. Looking in the callers: 3.4: ncalls tottime cumtime 2400 0.108 15.787 BindingXMLRPC.py:408(BindingXMLRPC.vmGetStats) 8321 0.669 99.654 vdsmInterface.py:87(vdsmInterface.getVmMemoryStats) 8321 0.583 94.921 vdsmInterface.py:142(vdsmInterface.getVmBalloonInfo) 3.5: ncalls tottime cumtime 1976 0.130 8.933 API.py:1274(Global.getAllVmStats) 7950 0.802 43.623 Bridge.py:249(DynamicBridge._dynamicMethod) 8727 1.335 75.072 vdsmInterface.py:87(vdsmInterface.getVmMemoryStats) 8727 1.257 72.633 vdsmInterface.py:142(vdsmInterface.getVmBalloonInfo) 8727 1.245 71.141 vdsmInterface.py:166(vdsmInterface.getVmCpuTuneInfo) vdsmInterface.py:166(vdsmInterface.getVmCpuTuneInfo) is responsible for the difference - this is the cost of adding this feature. Bridge.py:249(DynamicBridge._dynamicMethod) (jsonrpc) is calling this function 7950 times while in 3.4 BindingXMLRPC.py:408(BindingXMLRPC.vmGetStats) call it only 2400 times - we should understand why. 3. MultiProtocolAcceptor._process_events ---------------------------------------- protocoldetector.py:94(MultiProtocolAcceptor._process_events) takes 3.3% of the cpu time in 3.5 - this is already fixed in 3.5.1 - see bug 1177533 4. MOM ------ spark.py:211(Parser.buildState) is taking much more time on 3.5 - it is called much more and consume 40 seconds *in* the fuction in 3.5, but 0 seconds in 3.4 - look like a bad change in this function. 3.4: ncalls tottime percall cumtime percall filename:lineno(function) 47369 0.000 0.000 52.608 0.001 spark.py:211(Parser.buildState) 3.5: ncalls tottime percall cumtime percall filename:lineno(function) 157109 40.947 0.000 91.926 0.001 spark.py:211(Parser.buildState) I suggest to repeat this profiling on much bigger machine so we can check handling of much more vms.
Created attachment 981166 [details] vdsmd_82vms_pure_json_1800sec_cpu
(In reply to Nir Soffer from comment #81) > 2. VM.getStats > -------------- > > vm.py:2849(Vm._getRunningVmStats) takes 9.3% of cpu time, and we don't see > related code in the top of the 3.4 profile. If we follow the function calling > this in 3.5, we see that the it is called by API.py:(VM.getStats). > > Comparing 3.4 and 3.5 results for this function: > > 3.4: > > ncalls tottime percall cumtime percall filename:lineno(function) > 19042 1.360 0.000 210.361 0.011 API.py:349(VM.getStats) > > 3.5: > > ncalls tottime percall cumtime percall filename:lineno(function) > 36131 4.779 0.000 271.401 0.008 API.py:355(VM.getStats) > > 3.5 is spending more time in this function, and calling it much more. > > Looking in the callers: > > 3.4: > > ncalls tottime cumtime > 2400 0.108 15.787 BindingXMLRPC.py:408(BindingXMLRPC.vmGetStats) > 8321 0.669 99.654 vdsmInterface.py:87(vdsmInterface.getVmMemoryStats) > 8321 0.583 94.921 > vdsmInterface.py:142(vdsmInterface.getVmBalloonInfo) > > > 3.5: > > ncalls tottime cumtime > 1976 0.130 8.933 API.py:1274(Global.getAllVmStats) > 7950 0.802 43.623 Bridge.py:249(DynamicBridge._dynamicMethod) > 8727 1.335 75.072 vdsmInterface.py:87(vdsmInterface.getVmMemoryStats) > 8727 1.257 72.633 > vdsmInterface.py:142(vdsmInterface.getVmBalloonInfo) > 8727 1.245 71.141 > vdsmInterface.py:166(vdsmInterface.getVmCpuTuneInfo) > > vdsmInterface.py:166(vdsmInterface.getVmCpuTuneInfo) is responsible for the > difference - this is the cost of adding this feature. > > Bridge.py:249(DynamicBridge._dynamicMethod) (jsonrpc) is calling this > function > 7950 times while in 3.4 BindingXMLRPC.py:408(BindingXMLRPC.vmGetStats) call > it only 2400 times - we should understand why. > note the vdsmInterface is MOM, the only part in vdsm calling getStats is actually more effective in 3.5. The getAllVmStats should be called exactly 900 times (every 15s), since it's not it's either the monitoring script or MOM causing extra calls
Eldad, I asked already, can we disable the MOM part (by removing the rules) to see the decrease in the parsing, and possibly in calls to getStats? It should significantly lower the usage
https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c36
(In reply to Eldad Marciano from comment #85) > https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c36 I want to disable MOM, not NUMA. NUMA is already inactive on your system(and would be interesting to see the effect when enabled)
ignoring MOM dramatically reduce the CPU usage.
Created attachment 982953 [details] vdsmd_25vms_json_withnoMOM_1800sec_cpu.prof
(In reply to Eldad Marciano from comment #91) > Created attachment 982953 [details] > vdsmd_25vms_json_withnoMOM_1800sec_cpu.prof unfortunately this is a different thing than advertised. It seems to be running for 259 seconds, includes VM creation. Still, very helpful, I'm spinning off a new bug for host monitoring flaws;-)
Eldad, since most of the child bugs are in, would it be possible to try to reproduce it in latest 3.6?
sure thing, it will be ok to test on top of this versions? - engine 3.6.0-0.11.master.el6 - vdsm vdsm-4.17.3-1.el7ev.noarch
yes, that's fine. those fixes are in for quite some time
(pending results from scale run)
Yaniv, the storage side is tracked in bug 1179950 and was fixed long time ago and was backported to 3.5. Removing the storage patch to prevent confusion.
Eldad, any updates?
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
(In reply to Red Hat Bugzilla Rules Engine from comment #99) > Bug tickets that are moved to testing must have target release set to make > sure tester knows what to test. Please set the correct target release before > moving to ON_QA. Dear bot, I'm sure tester has enough information where to test it. We have a Build ID field. I believe the status should have the priority, messing with status is creating more chaos than missing target release field
(In reply to Michal Skrivanek from comment #100) > (In reply to Red Hat Bugzilla Rules Engine from comment #99) > > Bug tickets that are moved to testing must have target release set to make > > sure tester knows what to test. Please set the correct target release before > > moving to ON_QA. > > Dear bot, > I'm sure tester has enough information where to test it. We have a Build ID > field. I believe the status should have the priority, messing with status is > creating more chaos than missing target release field Build ID is going away soon for milestones and releases. We had many bug bounce back due to wrong package being tested without package version.
(In reply to Michal Skrivanek from comment #98) > Eldad, any updates? Not yet, we have to deal with other priorities, update ASAP.
verified on top of vdsm-4.17.12-0.el7ev.noarch (rhevm 3.6.2) using default settings + CPU affinity on. response time is ~1.2 sec: [root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null real 0m1.268s user 0m0.952s sys 0m0.075s [root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null real 0m1.276s user 0m0.972s sys 0m0.055s [root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null real 0m1.272s user 0m0.945s sys 0m0.041s
(In reply to Eldad Marciano from comment #103) > verified on top of vdsm-4.17.12-0.el7ev.noarch (rhevm 3.6.2) > > using default settings + CPU affinity on. > response time is ~1.2 sec: > > [root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null > > real 0m1.268s > user 0m0.952s > sys 0m0.075s > [root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null > > real 0m1.276s > user 0m0.972s > sys 0m0.055s > [root@host04-rack04 ~]# time vdsClient -s 0 getAllVmStats > /dev/null > > real 0m1.272s > user 0m0.945s > sys 0m0.041s did you test with at least 66 vms running? thanks Sven