Description of problem: When running parallel disk migrations of one VM vdsm gets errors in getVmIoTune. VM state is declared as unknown in OVirt Web interface. A question mark is presented at the VM. Version-Release number of selected component (if applicable): OVirt 4.1.2 vdsm 4.19.15 How reproducible: 100% Steps to Reproduce: 1. Start VM with multiple disks (on NFS storage) 2. Start multiple disk migrations (5 in our case) Actual results: Migration gives errors. State of VM is set to unknwon. Expected results: Migrations should succeed. Additional info: VM disks are atteched by NFS protocol version 4.2. No cross checks with NFS 4.0/4.1 executed.
Created attachment 1294492 [details] vdsm log of hypervisor
Log boils down to: <<<< START VM 2017-07-04 21:23:31,459+0200 INFO (vm/af68904d) [virt.vm] (vmId='af68904d-d140-4d02-a1d7-118696d6ade7') <?xml version='1.0' encoding='UTF-8'?> <domain xmlns:ovirt="http://ovirt.org/vm/tune/1.0" type="kvm"> <name>xxx_test</name> ... <<<< Log infos about disk migrations 2017-07-04 21:38:12,714+0200 INFO (jsonrpc/1) [vdsm.api] START snapshot 2017-07-04 21:38:34,359+0200 INFO (jsonrpc/1) [vdsm.api] START snapshot 2017-07-04 21:38:54,653+0200 INFO (jsonrpc/7) [vdsm.api] START snapshot 2017-07-04 21:39:15,289+0200 INFO (jsonrpc/7) [vdsm.api] START snapshot 2017-07-04 21:39:30,138+0200 INFO (jsonrpc/5) [vdsm.api] START diskReplicateStart 2017-07-04 21:39:38,552+0200 INFO (jsonrpc/4) [vdsm.api] START snapshot 2017-07-04 21:39:53,699+0200 INFO (jsonrpc/7) [vdsm.api] START diskReplicateStart 2017-07-04 21:40:08,717+0200 INFO (jsonrpc/5) [vdsm.api] START diskReplicateStart <<<< VDSM error 2017-07-04 21:40:28,939+0200 ERROR (jsonrpc/4) [virt.vm] (vmId='af68904d-d140-4d02-a1d7-118696d6ade7') getVmIoTune failed (vm:2833) Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 2818, in getIoTuneResponse libvirt.VIR_DOMAIN_AFFECT_LIVE) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 77, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats) 2017-07-04 21:40:28,947+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmIoTunePolicies succeeded in 30.02 seconds (__init__:533)
We received those errors during storage timeouts. The reason is not ovirt side. it is only the evidence of the error. Nevertheless one question remains: Can a blocked storage domain lead to a long held a lock in remoteDispatchConnectGetAllDomainStats (libvirt I quess)? This should be a kvm domain only check command.
(In reply to Markus Stockhausen from comment #3) > We received those errors during storage timeouts. The reason is not ovirt > side. it is only the evidence of the error. > > Nevertheless one question remains: Can a blocked storage domain lead to a > long held a lock in remoteDispatchConnectGetAllDomainStats (libvirt I > quess)? This should be a kvm domain only check command. Hi Markus. Unfortunately, the answer is "yes". We depend a libvirt to get the state of the system, and if one domain is busy performing expensive state changes, like disk replication, other operations may experience long delay. This is because the qemu/kvm command protocol is stricly sequential. So it boils down to the lower layers and their behaviour. What oVirt can do, however, is minimize the queries to libvirt. We added and we are adding caches where it makes sense, leveraging the fact that oVirt is the owner of the nodes - noone could make changes outside oVirt's control. The work we did to fix https://bugzilla.redhat.com/show_bug.cgi?id=1443654 should help in this case as well. Please try Vdsm 4.19.16 and/or oVirt 4.1.3.
Taking the bug and keeping open for some more time, but given comments 1,2,3 there is not much we can do there, no evidence of oVirt issue.
actually, better to close this. Please reopen if there is new evidence.