easy fix addressing a significant slowdown in a frequent API call +++ This bug was initially created as a clone of Bug #1102147 +++ Description of problem: Host state is not stable after population of 200 VMs Every 4-5 min host switched to Connecting state 2014-05-28 14:01:55,464 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-32) Command GetAllVmStatsVDSCommand(HostName = HOST-REAL, HostId = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc, vds=Host[HOST-REAL,ee38bc5a-0ad7-4034-80a6-a2af84eac5dc]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException 2014-05-28 14:01:55,542 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-32) Failed to refresh VDS , vds = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc : HOST-REAL, VDS Network Error, continuing.java.util.concurrent.TimeoutException occurs in engine.log and Thread-886::ERROR::2014-05-28 13:22:28,627::SecureXMLRPCServer::168::root::(handle_error) client ('10.35.3.13', 59944) Traceback (most recent call last): File "/usr/lib64/python2.6/SocketServer.py", line 570, in process_request_thread File "/usr/lib64/python2.6/SocketServer.py", line 332, in finish_request File "/usr/lib64/python2.6/SocketServer.py", line 627, in __init__ File "/usr/lib64/python2.6/BaseHTTPServer.py", line 331, in handle File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 223, in do_POST File "/usr/lib64/python2.6/socket.py", line 324, in write File "/usr/lib64/python2.6/socket.py", line 303, in flush File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 222, in write File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 205, in _write_bio SSLError: bad write retry occurs in vdsm.log Version-Release number of selected component (if applicable): RHEVM : 3.4.0-0.16.rc.el6ev OS Version: RHEV Hypervisor - 6.5 - 20140313.1.el6ev Kernel Version: 2.6.32 - 431.5.1.el6.x86_64 KVM Version: 0.12.1.2 - 2.415.el6_5.6 LIBVIRT Version: libvirt-0.10.2-29.el6_5.5 VDSM Version: vdsm-4.14.2-0.4.el6ev How reproducible: Steps to Reproduce: 1. Create 1 DC, 1 Cluster, 1 Host 2. Connect to iSCSI SD 3. Create 200 VMs Actual results: see above Expected results: Hst is stable Additional info: --- Additional comment from Yuri Obshansky on 2014-05-28 16:36:45 CEST --- --- Additional comment from Eldad Marciano on 2014-05-28 16:49:11 CEST --- +1 reproduced on 3.3 too --- Additional comment from Barak on 2014-05-29 14:28:01 CEST --- Yuri: - In what rate were the VMs created ? - In bulks? all together ? --- Additional comment from Barak on 2014-05-29 14:30:46 CEST --- Danken - does it smell like M2C ? --- Additional comment from Piotr Kliczewski on 2014-05-29 15:15:10 CEST --- It looks like it is similar to [1]. Can we get the size of the data to be sent as a response to GetAllVmStatsVDSCommand? [1] http://rt.openssl.org/Ticket/Display.html?id=598 --- Additional comment from Yuri Obshansky on 2014-05-29 15:54:26 CEST --- Barak - one by one - no bulks Piotr - how can I measure size of data? --- Additional comment from Piotr Kliczewski on 2014-05-29 16:10:54 CEST --- The easiest way to get the response size is to log or print len(response). You can modify [1] and log the length just before write in do_POST method. [1] /usr/lib64/python2.6/site-packages/vdsm/utils.py --- Additional comment from Dan Kenigsberg on 2014-05-29 16:18:26 CEST --- It might be a m2crypto issue, but I doubt that, as it happens every 4 minutes plust 5 or 10 seconds. It's a bit too periodical. It might be a matter of Engine choking up the connection periodically - can you check if there's a spike in traffic during the time of errors? --- Additional comment from Piotr Kliczewski on 2014-05-29 16:28:55 CEST --- It looks like quarz job on the engine tries to get all vm stats every couple of minutes and due to some reason it never receives any answer. After some time it recovers and tries again. I think that we need to understand why writing of the response fails on the vdsm side. --- Additional comment from Piotr Kliczewski on 2014-05-29 16:48:27 CEST --- There is no source on this host machine so we can't modify it to see the size. I will try to recreate the bug by playing with the size of the response. --- Additional comment from Yaniv Bronhaim on 2014-06-02 11:53:08 CEST --- Its a matter of load. running "vdsClinet getAllVmStats" on that host (which runs 200 vms) takes 4 minutes to return. The exception in VDSM is one issue that must be solved and it sounds like regression. From engine's prospective, 4 minutes must reach a timeout, we can't do much about it . Vdsm needs to response faster --- Additional comment from Yaniv Bronhaim on 2014-06-02 12:12:13 CEST --- PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23221 vdsm 0 -20 32.2g 900m 9m S 1066.6 0.2 110:59.49 vdsm 61074 root 20 0 7290m 31m 8112 S 114.9 0.0 12899:21 libvirtd 29402 root 20 0 17092 3356 1004 R 21.0 0.0 0:30.96 top 29582 qemu 20 0 1726m 1.2g 6364 S 8.3 0.3 439:08.29 qemu-kvm ... not sure it's suppose to work... have we ever checked 200*3 python's threads at same time? --- Additional comment from Piotr Kliczewski on 2014-06-02 12:37:57 CEST --- The response size if over 370k so following the issue from the link it seems that during sending this huge response there are new messages received and sent so try fails. There are similar issues [1][2] that people see with openssl. [1] http://stackoverflow.com/questions/20715408/solved-pyopenssl-ssl3-write-pendingbad-write-retry-return-self-sslobj-write [2] http://stackoverflow.com/questions/14695247/in-php-openssl-error-messages-error-1409f07f-ssl-routines-ssl3-write-pendin --- Additional comment from Piotr Kliczewski on 2014-06-02 14:30:00 CEST --- --- Additional comment from Yaniv Bronhaim on 2014-06-02 15:13:34 CEST --- The talked host contains 400GB ram and 32 cpu cores. what limitation do we on such hosts? do we allow to ran 400 vms each with 1GB ram? don't we need to have a limitation based on cpus also? currently we're testing what is the maximum number of vms that this host can run before reaching to such slow response that engine throws timeout on --- Additional comment from Michal Skrivanek on 2014-06-02 15:40:25 CEST --- adding Francesco. Could the getBalloonInfo be of help? I suppose with 200VMs eliminating the dom access may help at least with the responsiveness of the getstats call. Can we try VMs without balloon? worth a try... @Yaniv, 32GB virtual size is just ill, any idea why? --- Additional comment from Yuri Obshansky on 2014-06-03 11:20:29 CEST --- We started reproduce that issue and find out the limit. We cleanup the RHEVM, reinstall vdsm on host machine and started populate VMs from 0 till get the exception. It happened on 182-th VMs. 2014-06-03 12:13:42,876 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-7) [17f97fa8] Command GetAllVmStatsVDSCommand(HostName = HOST-REAL, HostId = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc, vds=Host[HOST-REAL,ee38bc5a-0ad7-4034-80a6-a2af84eac5dc]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException --- Additional comment from Dan Kenigsberg on 2014-06-03 11:51:00 CEST --- (In reply to Yaniv Bronhaim from comment #15) > don't we need > to have a limitation based on cpus also? I don't know if such a limitation exists on Engine side. In my opinion, it should not, as CPU over-commitment is one of the benefits of virtualization. If the guests do nothing, and Vdsm (and the rest of the management stack) is responsive - we should accept more VMs. --- Additional comment from Francesco Romani on 2014-06-03 14:33:33 CEST --- (In reply to Yaniv Bronhaim from comment #11) > Its a matter of load. running "vdsClinet getAllVmStats" on that host (which > runs 200 vms) takes 4 minutes to return. The exception in VDSM is one issue > that must be solved and it sounds like regression. From engine's > prospective, 4 minutes must reach a timeout, we can't do much about it . > Vdsm needs to response faster For the response time, this patch may provide some relief: http://gerrit.ovirt.org/#/c/27696/5 I'm all in to make things faster, but we'll need some profile data for this, but this will in turn require VDSM >= 4.15 (yappi support was merged well after 4.14) Lastly, the amount of memory consumed is indeed worrysome. I'm thinking to add optional support for memory profiling as well. --- Additional comment from Dima Kuznetsov on 2014-06-11 16:05:13 CEST --- Yuri, I've tried Francesco's patch on my humble i7 quad-core dell with 125 faqemu vms and it got vdsClient getAllVmStats response time from 6 minutes to 1 seconds (!), can you try it? --- Additional comment from Yuri Obshansky on 2014-06-11 16:15:05 CEST --- Yes, sure I'm preparing environment for bug verification right now Send me patch and guide how to install it by e-mail I think, we should first reproduce bug, only after that apply patch and again try reproduce. --- Additional comment from Francesco Romani on 2014-06-13 09:59:12 CEST --- VDSM patch backported here: http://gerrit.ovirt.org/#/c/28686/1 --- Additional comment from Michal Skrivanek on 2014-06-13 15:23:17 CEST --- I don't think the patch was meant to steal the bug completely Also @Yuri, be careful when answering on a bug with multiple needinfos:) (In reply to Dan Kenigsberg from comment #18) > to have a limitation based on cpus also? there's no such limitation. +1 on being able to absorb more (In reply to Yaniv Bronhaim from comment #12) > have we ever checked 200*3 python's threads at same time? why *3? also please see comment #16 --- Additional comment from Yaniv Bronhaim on 2014-06-15 08:58:41 CEST --- I don't know the reason for this huge virtual memory capacity , but still looks that Francesco's patch have huge improvement impact on that. Dima will share profiling output and usage status soon for more investigation --- Additional comment from Michal Skrivanek on 2014-06-17 09:23:53 CEST --- do we want to do anything else except the balloon here? If not, I'd take it under virt and put Francesco in charge of backports…. --- Additional comment from Yaniv Bronhaim on 2014-06-17 10:12:45 CEST --- the backport already pushed but not verified [1]. Dima works with Yuri to find limitations and see if the patch helps on his environment. If we'll find out that Francesco's improvement is enough to fix the communication issue I'll forward you the bug to continue [1] http://gerrit.ovirt.org/#/c/28686 --- Additional comment from Yaniv Bronhaim on 2014-06-17 14:00:00 CEST --- Looks like the patch fix the issue for this amount of vms on the same environment, it reduces the response for getAllVmStats for 30 seconds --- Additional comment from Yuri Obshansky on 2014-06-18 14:06:01 CEST --- Applied patch has resolved our problem with 200 VMs Host is stable. No exceptions. But. We tried unsuccessfully create additional 100 VMs (increase number of VMs up to 300). Engine (or vdsm) started to lose VMs VM every minute changed state from UP to Non responsive and vice versa I found only that info message in engine.log 2014-06-18 11:17:47,316 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-46) [a94764c] VM scale_108 c03cdb08-bc8c-4649-b034-617492f87b71 moved from Up --> NotResponding 2014-06-18 11:17:47,324 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-46) [a94764c] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM scale_108 is not responding. ............ 2014-06-18 11:18:10,845 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-41) [2a9e7042] VM scale_108 c03cdb08-bc8c-4649-b034-617492f87b71 moved from NotResponding --> Up Looks like it is a limit - 200 VMs See attached new log files. --- Additional comment from Yuri Obshansky on 2014-06-18 14:06:58 CEST --- --- Additional comment from Yuri Obshansky on 2014-06-18 14:07:32 CEST --- --- Additional comment from Francesco Romani on 2014-06-18 14:28:41 CEST --- Hi, the VDSM log is too tiny, there isn't much to see here (it covers ~18 seconds!). It is possible to have more VDSM logs? We'd probably greatly benefit from profiling data as Dan suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c19 --- Additional comment from Yuri Obshansky on 2014-06-18 14:45:21 CEST --- Yes, sure Uploaded 5 VDSM log files --- Additional comment from Yuri Obshansky on 2014-06-18 14:45:56 CEST --- --- Additional comment from Yuri Obshansky on 2014-06-18 14:46:28 CEST --- --- Additional comment from Yuri Obshansky on 2014-06-18 14:46:59 CEST --- --- Additional comment from Yuri Obshansky on 2014-06-18 14:47:34 CEST --- --- Additional comment from Yuri Obshansky on 2014-06-18 14:48:03 CEST --- --- Additional comment from Michal Skrivanek on 2014-06-18 16:41:50 CEST --- it looks like a nicely overloaded system. Can we get sar file? or at least a load value? it may be just that we're simply over the limit the host can handle...
*** Bug 1102814 has been marked as a duplicate of this bug. ***
Bug verification failed. I ran simple script which perform command getAllVmStats and measure response time See results: getAllVmStats response time w/o patch min 37.319 sec average 46.398 sec max 52.005 sec getAllVmStats response time with patch min 33.058 sec average 48.268 sec max 56.701 sec
RHEVM - 3.5.0-0.17.beta.el6ev VDSM - vdsm-4.16.7.1-1.el6ev
(In reply to Yuri Obshansky from comment #7) > Bug verification failed. > I ran simple script which perform command getAllVmStats > and measure response time > See results: > getAllVmStats response time w/o patch > min 37.319 sec > average 46.398 sec > max 52.005 sec > > getAllVmStats response time with patch > min 33.058 sec > average 48.268 sec > max 56.701 sec Inital observation suggests that something else is eating the gains obtained with this patch. Difference is likely little enough to get lost in variation. The times are too high anyway.
not failed, still in investigation of whether it improved things or not, so moving bug back to appropriate state Need to figure out a better verification test or find the source of high latency in general. May need an artificial test...but I'm not sure it's worth the effort. There are likely other reasons for ~45s average and that's something we should understand...
Michal, could you please suggest how we should test it, or maybe point us to other info we can provide to help you figure this out?
as I said, perhaps only by artificial test, run the getallvmstats in a loop as fast as you can with and without the patch, you *should* see a real difference; again, not that it's that important to prove it as the benefit is obvious from the flow perspective...it may be just negligent/hidden by other stuff easily
Michal, based on comment #7 we already tested your suggested flow, and I can't say we see real difference. How do you suggest we act on this BZ?
it was not really representative, the conditions are specific and i suggested artificial test. But IMHO not worth it and you can close the bug
Closing based on comment #14