Red Hat Bugzilla – Bug 1467806
getVmIoTune failed during parallel disk migration
Last modified: 2017-07-12 04:33:04 EDT
Description of problem:
When running parallel disk migrations of one VM vdsm gets errors in getVmIoTune. VM state is declared as unknown in OVirt Web interface. A question mark is presented at the VM.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Start VM with multiple disks (on NFS storage)
2. Start multiple disk migrations (5 in our case)
Migration gives errors. State of VM is set to unknwon.
Migrations should succeed.
VM disks are atteched by NFS protocol version 4.2. No cross checks with NFS 4.0/4.1 executed.
Created attachment 1294492 [details]
vdsm log of hypervisor
Log boils down to:
<<<< START VM
2017-07-04 21:23:31,459+0200 INFO (vm/af68904d) [virt.vm] (vmId='af68904d-d140-4d02-a1d7-118696d6ade7') <?xml version='1.0' encoding='UTF-8'?>
<domain xmlns:ovirt="http://ovirt.org/vm/tune/1.0" type="kvm">
<<<< Log infos about disk migrations
2017-07-04 21:38:12,714+0200 INFO (jsonrpc/1) [vdsm.api] START snapshot
2017-07-04 21:38:34,359+0200 INFO (jsonrpc/1) [vdsm.api] START snapshot
2017-07-04 21:38:54,653+0200 INFO (jsonrpc/7) [vdsm.api] START snapshot
2017-07-04 21:39:15,289+0200 INFO (jsonrpc/7) [vdsm.api] START snapshot
2017-07-04 21:39:30,138+0200 INFO (jsonrpc/5) [vdsm.api] START diskReplicateStart
2017-07-04 21:39:38,552+0200 INFO (jsonrpc/4) [vdsm.api] START snapshot
2017-07-04 21:39:53,699+0200 INFO (jsonrpc/7) [vdsm.api] START diskReplicateStart
2017-07-04 21:40:08,717+0200 INFO (jsonrpc/5) [vdsm.api] START diskReplicateStart
<<<< VDSM error
2017-07-04 21:40:28,939+0200 ERROR (jsonrpc/4) [virt.vm] (vmId='af68904d-d140-4d02-a1d7-118696d6ade7') getVmIoTune failed (vm:2833)
Traceback (most recent call last):
File "/usr/share/vdsm/virt/vm.py", line 2818, in getIoTuneResponse
File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 77, in f
TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)
2017-07-04 21:40:28,947+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmIoTunePolicies succeeded in 30.02 seconds (__init__:533)
We received those errors during storage timeouts. The reason is not ovirt side. it is only the evidence of the error.
Nevertheless one question remains: Can a blocked storage domain lead to a long held a lock in remoteDispatchConnectGetAllDomainStats (libvirt I quess)? This should be a kvm domain only check command.
(In reply to Markus Stockhausen from comment #3)
> We received those errors during storage timeouts. The reason is not ovirt
> side. it is only the evidence of the error.
> Nevertheless one question remains: Can a blocked storage domain lead to a
> long held a lock in remoteDispatchConnectGetAllDomainStats (libvirt I
> quess)? This should be a kvm domain only check command.
Unfortunately, the answer is "yes". We depend a libvirt to get the state of the system, and if one domain is busy performing expensive state changes, like disk replication, other operations may experience long delay. This is because
the qemu/kvm command protocol is stricly sequential.
So it boils down to the lower layers and their behaviour.
What oVirt can do, however, is minimize the queries to libvirt. We added and we are adding caches where it makes sense, leveraging the fact that oVirt is the owner of the nodes - noone could make changes outside oVirt's control.
The work we did to fix https://bugzilla.redhat.com/show_bug.cgi?id=1443654 should help in this case as well. Please try Vdsm 4.19.16 and/or oVirt 4.1.3.
Taking the bug and keeping open for some more time, but given comments 1,2,3 there is not much we can do there, no evidence of oVirt issue.
actually, better to close this. Please reopen if there is new evidence.