During the verification of Bug #1102147 , which was about preventing timeout and network exception due to system slow to respond to ballooning queries, concerns where raised about the performance side. This bug is to track and investigate these concerns. Relevant history from the bz 1102147: +++ This bug was initially created as a clone of Bug #1102147 +++ --- Additional comment from Yuri Obshansky on 2014-11-10 02:13:30 EST --- Bug verification failed. RHEVM - 3.5.0-0.17.beta.el6ev VDSM - vdsm-4.16.7.1-1.el6ev I ran simple script which perform command getAllVmStats and measure response time See results: getAllVmStats response time w/o patch min 37.319 sec average 46.398 sec max 52.005 sec getAllVmStats response time with patch min 33.058 sec average 48.268 sec max 56.701 sec --- Additional comment from Yaniv Bronhaim on 2014-11-11 06:30:02 EST --- I suggest to open new bug . the fix here refers to virt by polling the vms status and it decreases the penalty of it - this was merged to 3.5 as the bug is targeted. although performance issues regarding running large scale of vms can lead to many bottle necks this bottle neck is solved in the bugzilla (fromani - please change the bug title according to the fix) Yuri, 5 second better is also something meaningful. but we need to understand more deeply what takes long and where to investigate. please open new bug with the latest logs and we'll try to investigate more and close this one --- Additional comment from Francesco Romani on 2014-11-11 06:40:26 EST --- (In reply to Yaniv Bronhaim from comment #59) > I suggest to open new bug . the fix here refers to virt by polling the vms > status and it decreases the penalty of it - this was merged to 3.5 as the > bug is targeted. although performance issues regarding running large scale > of vms can lead to many bottle necks > this bottle neck is solved in the bugzilla (fromani - please change the bug > title according to the fix) Done --- Additional comment from Yuri Obshansky on 2014-11-16 09:00:41 EST --- I ran script with getAllVmStats to test Francesco patch which is fix for bug 1110835(it is a clone of current bug). Anyway, I rerun test using original flow with only one difference - I repeated that flow (create 200 VMs) twice: 1. host created without jsonrpc protocol (using old xmlrpc) 2. host created with jsonrpc protocol. Executive summary: - Test with jsonrpc protocol failed. - VM creation process started to be slower and slower after we've created more that 80 VMs. - DC and Storage Domain crashed each 1/2 hours of running. - Last 111 VMs creation process got 7 minutes. - Environment was not functional from my point of view. Exceptions raised in engine log: 2014-11-16 14:22:45,120 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-15) [69a1a5c] Correlation ID: 69a1a5c, Job ID: b1d0760c-c346-4037-84c5-30bc8b73c58e, Call Stack: null, Custom Event ID: -1, Message: Failed to Reconstruct Master Domain for Data Center DC-REAL. 2014-11-16 14:25:42,260 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-60) [409d1fa0] ERROR, org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand, exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues', log id: 7d4be17c: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues' 2014-11-16 14:25:42,275 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-60) [409d1fa0] IRS failover failed - cant allocate vds server 2014-11-16 14:25:42,520 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-7) Operation Failed: [Cannot run VM. The relevant Storage Domain's status is Inactive.] Error raised in vdsm log: GuestMonitor-scale-107::DEBUG::2014-11-16 14:27:08,414::vm::486::vm.Vm::(_getUserCpuTuneInfo) vmId=`535fb4ef-a64b-4754-b700-87a8e32d31b5`::Domain Metadata is not set Thread-140622::DEBUG::2014-11-16 14:27:08,419::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present Host details: vdsm-4.16.7.1-1.el6ev libvirt-0.10.2-46.el6_6.1 32 CPU Cores 400 G RAM --- Additional comment from Francesco Romani on 2014-11-24 04:07:14 EST --- This bug is becoming too complex to follow. I don't think the issues found in ReconstructMasterDomainCommand flow with json rpc, although of course important, belong in this bug. I suggest to open a new one to better track them. I also believe that is best to file a new one addressing the performance characteristics of this patch as concerns were raised in https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c58 This bug was about avoiding timeouts and exceptions, and verfication *did not* proved it faulty (although of course more verification is welcome). Barak, Michal, what do you think about the above?
This is not relevant anymore due to complete overhaul of sampling subsystem in VDSM that was implemented in 3.6.0. Now all the libvirt calls always happen on a thread pool so they always are asynchronous with respect to the server threads - the one that answers to queries from Engine/clients. The queries, be them event notifications or poll requests, always fetch cached data.