Description of problem: During REST API performance sanity test execution on RHEV-M 3.5 beta detected critical performance degradation compare to RHEV-M 3.4. Additional test without JSON-RPC protocol showed better results close to RHEV-M 3.4 baseline test results. Summary comparison table results: Transactions Average 90% Line Max 3.4 3.5 3.4 3.5 3.4 3.5 JSON NOT JSON NOT JSON NOT Login to VM 312 1058 570 401 2318 807 2163 4456 2043 VM Application 24 32 19 42 28 24 294 656 265 VM Disks 27 26 24 38 33 31 163 150 99 VM Nics 35 27 25 54 35 32 151 234 97 VM Cdroms 23 19 18 41 24 23 58 43 71 VM Snapshots 8 10 9 10 14 12 199 152 43 VM Permissions 8 11 10 11 15 13 221 40 42 VM Statistics 24 21 20 42 27 24 136 44 55 VM Devices 8 10 9 11 14 12 71 42 81 VM Shutdown 743 46221 1556 2249 147839 5145 11969 364358 15654 VM Status 28 28 26 48 32 31 192 3571 1080 VM Start 2206 106513 3617 5749 241912 8732 18389 599149 25040 Detailed tests reports could be found on RHEV-M 3.5 beta User Portal REST API’s Performance Test Results - 50 VMs https://mojo.redhat.com/docs/DOC-994846 and RHEV-M 3.5 beta User Portal REST API’s Performance Test Results - 50 VMs - without JSON protocol https://mojo.redhat.com/docs/DOC-995332 Version-Release number of selected component (if applicable): RHEV-M: 3.5.0-0.14.beta RHEL: 6Server - 6.5.0.1.el6 Kernel Version: 2.6.32 - 431.29.2.el6.x86_64 KVM Version: 0.12.1.2 - 2.415.el6_5.10 LIBVIRT Version: libvirt-0.10.2-29.el6_5.11 VDSM Version: vdsm-4.16.6-1.el6ev How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Can you please provide jprofiler snapshots after each operation that you performed? I would like to identify bottlenecks. From REST api perspective jsonrpc emulates synchronous behavior so there should be no or little difference.
This test is 50 concurrent threads load. Please, take a look of detailed test reports on mojo What do you mean each operation? How do you think I can do it? Yuri
I already looked at your detailed reports but I would like to understand how different part of the code behave during/after the test to identify bottlenecks. Can you please provide few jprofiler snapshots?
I can create jprofiler snapshot for full test is running. I can decreased number of threads. Let's say 1 or 5 threads. Please, confirm is it OK.
I confirm.
Status update: It seems like a DB issue. Still being examined by scale team and infra.
Francessco, Please take a look at the 2 operations (vdsm profiling) from the table above VM Shutdown & VM Start. The above results are the totals from the engine side, I'm wondering whether there is a problem (a big difference in vdsm) between usinf XML rpc & Json. It could be one call for each operation + profiling (in the same env) just to get a clearer picture.
This is very interesting information. From virt's perspective there is no difference between JSONRPC and XMLRPC, but surely there are bugs. The timeouts/errors can cause spikes in the response time, so we need these patches in before to repeat the tests to have meaningful information: http://gerrit.ovirt.org/#/c/34147/ http://gerrit.ovirt.org/#/c/34813/ most likely also http://gerrit.ovirt.org/#/c/34879/ Meanwhile, I'll setup some profiling to see where the time is spent when everything goes well and if there are some actual differences I'm not aware of.
I see only http://gerrit.ovirt.org/#/c/34879/ wasn't accepted yet (and all should be back-ported to 3.5). I'm leaving the need info to make sure we get the profiling info once it's ready.
(In reply to Francesco Romani from comment #8) > This is very interesting information. > From virt's perspective there is no difference between JSONRPC and XMLRPC, > but surely there are bugs. > > The timeouts/errors can cause spikes in the response time, so we need these > patches in before to repeat the tests to have meaningful information: > > http://gerrit.ovirt.org/#/c/34147/ > http://gerrit.ovirt.org/#/c/34813/ > most likely also > http://gerrit.ovirt.org/#/c/34879/ > > Meanwhile, I'll setup some profiling to see where the time is spent when > everything goes well and if there are some actual differences I'm not aware > of. The above are fixes for correctness. Here I'm adding performance improvement patches: http://gerrit.ovirt.org/#/c/35212/ http://gerrit.ovirt.org/#/c/34989/ http://gerrit.ovirt.org/#/c/35024/ all of the above are simple and quite safe. profiling is ongoing.
I managed to get a decent profiling of the create/shutdown flow. The problem is to minimize the noise introduced by sampling, without actually changing (too much) the VDSM code. So, I used a script like: #!/bin/bash [ -f /var/run/vdsm/vdsmd.prof ] && rm -f /var/run/vdsm/vdsmd.prof vdsClient -s 0 startProfile &> /dev/null for i in `seq 1 100`; do <create VM> && <destroy VM> done vdsClient -s 0 stopProfile &> /dev/null mv /var/run/vdsm/vdsmd.prof $1.prof and I checked VDSM master (at f40345d8e4) - relevant patches being backported patches needed to run profiles: http://gerrit.ovirt.org/#/c/35016/ http://gerrit.ovirt.org/#/c/35181/ http://gerrit.ovirt.org/#/c/35292 patches used in measurement in addition to master at SHA1 above: http://gerrit.ovirt.org/#/c/34989/ results for XMLRPC 2192340 function calls (2229132 primitive calls) in 51.597 CPU seconds Ordered by: internal time List reduced from 1126 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 3 36.125 12.042 36.204 12.068 PolicyEngine.py:108(PolicyEngine.do_controls) 24436 2.304 0.000 0.114 0.000 pthread.py:98(Lock.unlock) 11650 2.042 0.000 2.156 0.000 __init__.py:423(Formatter.format) 17 0.668 0.039 0.672 0.040 storage_mailbox.py:621(SPM_MailMonitor._handleRequests) 7812 0.366 0.000 0.580 0.000 __init__.py:226(LogRecord.__init__) 88446 0.309 0.000 0.420 0.000 StringIO.py:208(StringIO.write) 21800 0.259 0.000 0.436 0.000 lvm.py:217(makeLV) 100 0.248 0.002 1.650 0.016 lvm.py:411(LVMCache._reloadlvs) 100 0.247 0.002 2.008 0.020 blockSD.py:146(_getVolsTree) 100 0.239 0.002 0.243 0.002 lvm.py:446(<genexpr>) 69912 0.230 0.000 0.280 0.000 string.py:511(replace) 201/2502 0.211 0.001 0.827 0.000 xmlrpclib.py:765(Marshaller.dump_struct) 107032 0.199 0.000 0.255 0.000 <string>:8(__new__) 802/10024 0.193 0.000 1.048 0.000 vmxml.py:473(Element.__hacked_writexml) 100 0.191 0.002 2.360 0.024 blockSD.py:1012(BlockStorageDomain.getAllVolumesImages) 24436 0.178 0.000 0.292 0.000 pthreading.py:77(Lock.release) 87200 0.163 0.000 0.638 0.000 codecs.py:347(StreamWriter.write) 7812 0.162 0.000 1.632 0.000 __init__.py:1204(Logger.callHandlers) 5586 0.134 0.000 0.262 0.000 expatbuilder.py:743(ExpatBuilderNS.start_element_handler) 22362/22363 0.131 0.000 1.231 0.000 protocoldetector.py:108(MultiProtocolAcceptor._process_events) results for JSONRPC 1788411 function calls (1813944 primitive calls) in 13.510 CPU seconds Ordered by: internal time List reduced from 1161 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 8768 1.831 0.000 0.712 0.000 __init__.py:1142(Logger.makeRecord) 10074 0.585 0.000 0.009 0.000 minidom.py:473(NamedNodeMap.__init__) 15 0.521 0.035 0.525 0.035 storage_mailbox.py:621(SPM_MailMonitor._handleRequests) 21829/22018 0.403 0.000 0.414 0.000 lvm.py:217(makeLV) 89079 0.267 0.000 0.379 0.000 StringIO.py:208(StringIO.write) 101 0.234 0.002 0.238 0.002 lvm.py:446(<genexpr>) 101 0.225 0.002 1.614 0.016 lvm.py:411(LVMCache._reloadlvs) 8312 0.201 0.000 0.159 0.000 __init__.py:1121(Logger.findCaller) 322/326 0.195 0.001 9.402 0.029 threading.py:481(Thread.run) 8768 0.190 0.000 1.956 0.000 __init__.py:1204(Logger.callHandlers) 13802 0.185 0.000 0.569 0.000 __init__.py:423(Formatter.format) 23951/23952 0.177 0.000 1.574 0.000 protocoldetector.py:108(MultiProtocolAcceptor._process_events) 86587 0.172 0.000 0.595 0.000 codecs.py:347(StreamWriter.write) 423 0.171 0.000 0.036 0.000 stompReactor.py:266(_AsyncoreEvent.set) 13323 0.169 0.000 0.932 0.000 __init__.py:764(UserGroupEnforcingHandler.emit) 13802 0.159 0.000 0.175 0.000 __init__.py:289(LogRecord.getMessage) 8090 0.152 0.000 0.259 0.000 minidom.py:111(Document.appendChild) 13800 0.151 0.000 1.767 0.000 __init__.py:666(UserGroupEnforcingHandler.handle) 27175 0.150 0.000 0.150 0.000 pthread.py:95(RLock.lock) 22795 0.148 0.000 0.291 0.000 minidom.py:299(_write_data) there is still a bit too much noise in the profiles, but nothing suggests JSONRPC is slower than XMLRPC in VDSM. There are instead suggestion that the contrary is true. Of course this is true with these patches applied: http://gerrit.ovirt.org/#/c/35024/ (-> backport to 3.5: http://gerrit.ovirt.org/#/c/35290/) http://gerrit.ovirt.org/#/c/35212/
Created attachment 958655 [details] VDSM profiles content of the archive: NS_C000_vdscli.json - VM definition to be used with vdsClient NS_C000_jsonrpc.json - VM definition to be used with jsonrpc NS_C000_xmlrpc.prof - profile gathered with 'startxml.sh' NS_C000_jsonrpc.prof - profile gathered with 'starthson.sh' startjson.sh - JSONRPC test runner startxml.sh - XMLRPC test runner jsonrpc - JSONRPC simple client profile-stats - python script to inspect .prof files (like in vdsm/contrib)
Bug was verified on RHEV-M 3.5.22 and vdsm-4.16.7.6-1.el6ev Performance problem with JSON-RPC protocol compare to RHEV-M 3.5.17 version has resolved. Response time of major REST APIs dramatically decreased (90% Line response time): - GET VM details from 2318 to 615 ms - POST VM Shutdown from 147839 to 3040 ms - POST VM Start from 241912 to 3014 ms But still those value larger than we measured in RHEV-M 3.4 version (except the POST VM Start REST API): - GET VM details 615 vs 401 ms (~ 30%) - POST VM Shutdown 3040 vs 2249 ms (~25%) All processes are evenly and stable. No critical exception detected. We cannot close that bug till find out a solution for existed performance degradation for 2 REST APIs GET VM Details and POST VM Shutdown. Detailed report here: https://mojo.redhat.com/docs/DOC-1004044
Moving back to ASSIGNED based on comment #13. From QE perspective, if 3.5 does not show any improvement in performance, we would at least expect to see the same results we had with 3.4.
Updated the bug title accordingly. In addition, I don't think this gap on two requests compared for 3.4 should be considered as a blocker for the version. Scott - please input on that. Barak - should this be moved to virt as JSONRPC isn't the root cause here, and both requests are virt-related?
(In reply to Oved Ourfali from comment #15) > Updated the bug title accordingly. > In addition, I don't think this gap on two requests compared for 3.4 should > be considered as a blocker for the version. > > Scott - please input on that. > > Barak - should this be moved to virt as JSONRPC isn't the root cause here, > and both requests are virt-related? After discussing offline with Barak, I'm removing the blocker flag, and targeting to 3.5.Z. Scott - leaving needinfo on you, in case you think otherwise.
I'd be happy to review specific items, but IMHO we need to replay the same scenarios with the same speed. I'm not sure this is the case here. I.e. when your test creates 50 VMs...does it try to do it "as fast as it can"? If so then we can't look at it from the response time perspective only... Can you point me to the specific cases and what they were doing, how fast were the parallel requests send? I suppose we can talk only about the problematic VM create, VM shutdown and VM login cases...
Please, see details in report https://mojo.redhat.com/docs/DOC-1004044 Test didn't create VMs. It is a load test. The same scenarios, the same hardware.
(In reply to Yuri Obshansky from comment #18) > Please, see details in report > https://mojo.redhat.com/docs/DOC-1004044 > Test didn't create VMs. I don't understand then what the Create VM and Shutdown VM calls were/are supposed to do. Are they not creating and shutting down VMs? > It is a load test. you can't consistently measure responses of synchronous actions in a load test. As the load directly affects the responses.So faster you get in one call the more load you create, slowing down other calls. > The same scenarios, the same hardware.
(In reply to Michal Skrivanek from comment #20) > (In reply to Yuri Obshansky from comment #18) > > Please, see details in report > > https://mojo.redhat.com/docs/DOC-1004044 > > Test didn't create VMs. > > I don't understand then what the Create VM and Shutdown VM calls were/are > supposed to do. Are they not creating and shutting down VMs? They start and shutdown VMs in loop during test running 50 concurrent threads vs. 50 Vms using User security credentials > > > It is a load test. > > you can't consistently measure responses of synchronous actions in a load > test. As the load directly affects the responses.So faster you get in one > call the more load you create, slowing down other calls. > We are not talking about synchronous actions. We don't need it. JMeter measured each API calls time (from request till response), calculate min. average, 90% line and max response time. This test is performance regression test which perform the same scenario on the same hardware to detect degradation or improving of rhev-m REST API performance. Nothing should be synchronized in this scenario. Let me know, if you need additional an explanation about regression REST API performance test scenario and structure. > > The same scenarios, the same hardware.
(In reply to Yuri Obshansky from comment #23) > (In reply to Michal Skrivanek from comment #20) > > (In reply to Yuri Obshansky from comment #18) > > > Please, see details in report > > > https://mojo.redhat.com/docs/DOC-1004044 > > > Test didn't create VMs. > > > > I don't understand then what the Create VM and Shutdown VM calls were/are > > supposed to do. Are they not creating and shutting down VMs? > They start and shutdown VMs in loop during test running > 50 concurrent threads vs. 50 Vms using User security credentials > > > > > It is a load test. > > > > you can't consistently measure responses of synchronous actions in a load > > test. As the load directly affects the responses.So faster you get in one > > call the more load you create, slowing down other calls. > > > We are not talking about synchronous actions. We don't need it. JMeter > measured each API calls time (from request till response), calculate min. > average, 90% line and max response time. This test is performance regression > test which perform the same scenario on the same hardware to detect > degradation or improving of rhev-m REST API performance. Nothing should be > synchronized in this scenario. Let me know, if you need additional an > explanation about regression REST API performance test scenario and > structure. > > > > The same scenarios, the same hardware. that doesn't matter. As I said, you are measuring synchronous actions. Shutdown returns only once the VM properly shuts down. That depends on the overall load on the system. If you happen to stuff more parallel requests on the host (due to other improvements) and load the host more this particular call will take longer. Same applies to the other two calls (albeit only partially for start VM) What would be interesting to see is the duration of the test. if you're running for fixed time - the number of API calls. Or vice versa.
(In reply to Michal Skrivanek from comment #24) > (In reply to Yuri Obshansky from comment #23) > > (In reply to Michal Skrivanek from comment #20) > > > (In reply to Yuri Obshansky from comment #18) > > > > Please, see details in report > > > > https://mojo.redhat.com/docs/DOC-1004044 > > > > Test didn't create VMs. > > > > > > I don't understand then what the Create VM and Shutdown VM calls were/are > > > supposed to do. Are they not creating and shutting down VMs? > > They start and shutdown VMs in loop during test running > > 50 concurrent threads vs. 50 Vms using User security credentials > > > > > > > It is a load test. > > > > > > you can't consistently measure responses of synchronous actions in a load > > > test. As the load directly affects the responses.So faster you get in one > > > call the more load you create, slowing down other calls. > > > > > We are not talking about synchronous actions. We don't need it. JMeter > > measured each API calls time (from request till response), calculate min. > > average, 90% line and max response time. This test is performance regression > > test which perform the same scenario on the same hardware to detect > > degradation or improving of rhev-m REST API performance. Nothing should be > > synchronized in this scenario. Let me know, if you need additional an > > explanation about regression REST API performance test scenario and > > structure. > > > > > > The same scenarios, the same hardware. > > that doesn't matter. As I said, you are measuring synchronous actions. > Shutdown returns only once the VM properly shuts down. That depends on the > overall load on the system. If you happen to stuff more parallel requests on > the host (due to other improvements) and load the host more this particular > call will take longer. Same applies to the other two calls (albeit only > partially for start VM) > It is not exactly like you think. VM state changed to "powering_down" when I execute REST API /api/vms/${vm_id}/shutdown. REST API didn't wait till VM state will be changed to "down" and sent me response. I measure this time, which doesn't depends on other resources/overload/etc. My script has a synchronization i.e. it didn't start VM till it's state = "dowm" and vice versa for Shutdown API. > What would be interesting to see is the duration of the test. > if you're running for fixed time - the number of API calls. Or vice versa. Yes, I have that information: Count of performed API calls during 2 hours test. 3.4 - 750 3.5.17 - 450 3.5.22 - 714
ok, so start VM is clear. I just looked at wrong column. There's a 50% improvement. then we're left with shutdown and login. Both do invoke VM action synchronously, as part of your call. They both invokes guest agent. Since 3.4 vs 3.5.22 numbers are comparable it should +- match, indeed. Can you please check if the guests are also exactly the same? Can you reproduce the difference when you run just one single API call...and if so, get the logs from both versions. Only one call is enough
I think, it was the same version rhevm-guest-agent-common-1.0.9-1.el6ev
current thinking is the problem is related to the guest agent communication. The difference doesn't seem to be related to any code change, rather the differences in various runs.
Moving this one to virt, upon latest comments.
Yuri, what exactly is the "Login to VM" API call in your tests?
Actually, "Login to VM" is not Login. This is logical name of transaction. I send HTTP request to {engine_url}/api/vms using Authorization in HTTP Header with encoded string ${user}@nott04.eng.lab.tlv.redhat.com:{password} related to specific user defined in RHEV-M. This cal will return back specific VM details (xml). That's all.
ok, interesting. Can you reproduce the difference when you run just one single API call...and if so, get the logs from both versions. Only one call is enough Anyway the difference in the fixed version is not that big (~20%) and only for 2 calls, shutdown being the slowest, but also probably the least interesting - it is a long operation and it is the end of that particular VM so you don't do it often
It is a little bit complicated right now. I'm in the middle of other test execution. It requires install 3.4 run single and reinstall 3.5 which requires downgrade vdsm version oh host as well. It could get a several days of working. If you need single API call why you don't run in any development environment?
sure, but that's not high prio so I can't tell when we get to that there were no changes in this area between 3.4 and 3.5 which wouldn't show up in other calls. There's nothing specific to "shutdown" other than it's a synchronous call to the host, so unless you have the complete picture of the performance of the host the different result do not really mean anything. As I said, robably not interesting to do that For now I don't see anything to fix and I would suggest to close this.
(In reply to Yuri Obshansky from comment #31) > Actually, "Login to VM" is not Login. > This is logical name of transaction. > I send HTTP request to {engine_url}/api/vms > using Authorization in HTTP Header > with encoded string ${user}@nott04.eng.lab.tlv.redhat.com:{password} > related to specific user defined in RHEV-M. > This cal will return back specific VM details (xml). > That's all. so, is it 50 concurrent queries to all vms or a specific random vm? There were a few more features added in 3.5 so the response is larger than 3.4
(In reply to Michal Skrivanek from comment #35) > (In reply to Yuri Obshansky from comment #31) > > Actually, "Login to VM" is not Login. > > This is logical name of transaction. > > I send HTTP request to {engine_url}/api/vms > > using Authorization in HTTP Header > > with encoded string ${user}@nott04.eng.lab.tlv.redhat.com:{password} > > related to specific user defined in RHEV-M. > > This cal will return back specific VM details (xml). > > That's all. > > so, is it 50 concurrent queries to all vms or a specific random vm? RHEV-M setup has only 50 VMs, so each thread works with specific VM (user). Threads run concurrent, but calls performed randomly. That means, each thread started with delay 0.5 sec and sent query to VM in different time. This is real-life simulation method. > There were a few more features added in 3.5 so the response is larger than 3.4 I'm sure.
alright, let's move summarize the remaining issues... shutdown - well, unless this is reproducible on other setup and/or with a single VM my theory is that the difference is due to a different load during both runs. It's a sync call which can block on vdsm side for the time the qemu process needs to exit. vm get - you're authorizing/logging in with each request. When you care for performance you're supposed to update the scripts to use jsessionid to avoid this. So if an individual authentication is a bit slower it's not a big deal, IMHO. It also might depend on performance of your LDAP server. Overall, I don't see much to fix anymore, so proposing closing currentrelease (due to the previous real fix in jsonrpc)
(In reply to Michal Skrivanek from comment #37) > alright, let's move summarize the remaining issues... > > shutdown - well, unless this is reproducible on other setup and/or with a > single VM my theory is that the difference is due to a different load during > both runs. It's a sync call which can block on vdsm side for the time the > qemu process needs to exit. > > vm get - you're authorizing/logging in with each request. When you care for > performance you're supposed to update the scripts to use jsessionid to avoid > this. So if an individual authentication is a bit slower it's not a big > deal, IMHO. It also might depend on performance of your LDAP server. > > Overall, I don't see much to fix anymore, so proposing closing > currentrelease (due to the previous real fix in jsonrpc) I'm agreed with you. It's a minor difference (3 sec vs 2 sec) on current load schema. I suggest close it and go ahead to run new performance tests.