Bug 1467806 - getVmIoTune failed during parallel disk migration
getVmIoTune failed during parallel disk migration
Status: CLOSED NOTABUG
Product: vdsm
Classification: oVirt
Component: General (Show other bugs)
4.19.15
Unspecified Unspecified
unspecified Severity medium (vote)
: ---
: ---
Assigned To: Francesco Romani
Raz Tamir
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-05 04:18 EDT by Markus Stockhausen
Modified: 2017-07-12 04:33 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-12 04:33:04 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
vdsm log of hypervisor (545.40 KB, application/zip)
2017-07-05 04:20 EDT, Markus Stockhausen
no flags Details

  None (edit)
Description Markus Stockhausen 2017-07-05 04:18:58 EDT
Description of problem:

When running parallel disk migrations of one VM vdsm gets errors in getVmIoTune. VM state is declared as unknown in OVirt Web interface. A question mark is presented at the VM.

Version-Release number of selected component (if applicable):

OVirt 4.1.2 
vdsm 4.19.15

How reproducible:

100%

Steps to Reproduce:
1. Start VM with multiple disks (on NFS storage)
2. Start multiple disk migrations (5 in our case)

Actual results:

Migration gives errors. State of VM is set to unknwon.

Expected results:

Migrations should succeed.

Additional info:

VM disks are atteched by NFS protocol version 4.2. No cross checks with NFS 4.0/4.1 executed.
Comment 1 Markus Stockhausen 2017-07-05 04:20 EDT
Created attachment 1294492 [details]
vdsm log of hypervisor
Comment 2 Markus Stockhausen 2017-07-05 04:21:45 EDT
Log boils down to:

<<<< START VM
2017-07-04 21:23:31,459+0200 INFO  (vm/af68904d) [virt.vm] (vmId='af68904d-d140-4d02-a1d7-118696d6ade7') <?xml version='1.0' encoding='UTF-8'?>
<domain xmlns:ovirt="http://ovirt.org/vm/tune/1.0" type="kvm">
    <name>xxx_test</name>
...
<<<< Log infos about disk migrations
2017-07-04 21:38:12,714+0200 INFO  (jsonrpc/1) [vdsm.api] START snapshot
2017-07-04 21:38:34,359+0200 INFO  (jsonrpc/1) [vdsm.api] START snapshot
2017-07-04 21:38:54,653+0200 INFO  (jsonrpc/7) [vdsm.api] START snapshot
2017-07-04 21:39:15,289+0200 INFO  (jsonrpc/7) [vdsm.api] START snapshot
2017-07-04 21:39:30,138+0200 INFO  (jsonrpc/5) [vdsm.api] START diskReplicateStart
2017-07-04 21:39:38,552+0200 INFO  (jsonrpc/4) [vdsm.api] START snapshot
2017-07-04 21:39:53,699+0200 INFO  (jsonrpc/7) [vdsm.api] START diskReplicateStart
2017-07-04 21:40:08,717+0200 INFO  (jsonrpc/5) [vdsm.api] START diskReplicateStart

<<<< VDSM error 
2017-07-04 21:40:28,939+0200 ERROR (jsonrpc/4) [virt.vm] (vmId='af68904d-d140-4d02-a1d7-118696d6ade7') getVmIoTune failed (vm:2833)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 2818, in getIoTuneResponse
    libvirt.VIR_DOMAIN_AFFECT_LIVE)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 77, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)
2017-07-04 21:40:28,947+0200 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmIoTunePolicies succeeded in 30.02 seconds (__init__:533)
Comment 3 Markus Stockhausen 2017-07-07 07:19:25 EDT
We received those errors during storage timeouts. The reason is not ovirt side. it is only the evidence of the error.

Nevertheless one question remains: Can a blocked storage domain lead to a long held a lock in remoteDispatchConnectGetAllDomainStats (libvirt I quess)? This should be a kvm domain only check command.
Comment 4 Francesco Romani 2017-07-12 04:21:28 EDT
(In reply to Markus Stockhausen from comment #3)
> We received those errors during storage timeouts. The reason is not ovirt
> side. it is only the evidence of the error.
> 
> Nevertheless one question remains: Can a blocked storage domain lead to a
> long held a lock in remoteDispatchConnectGetAllDomainStats (libvirt I
> quess)? This should be a kvm domain only check command.

Hi Markus.

Unfortunately, the answer is "yes". We depend a libvirt to get the state of the system, and if one domain is busy performing expensive state changes, like disk replication, other operations may experience long delay. This is because
the qemu/kvm command protocol is stricly sequential.

So it boils down to the lower layers and their behaviour.
What oVirt can do, however, is minimize the queries to libvirt. We added and we are adding caches where it makes sense, leveraging the fact that oVirt is the owner of the nodes - noone could make changes outside oVirt's control.

The work we did to fix https://bugzilla.redhat.com/show_bug.cgi?id=1443654 should help in this case as well. Please try Vdsm 4.19.16 and/or oVirt 4.1.3.
Comment 5 Francesco Romani 2017-07-12 04:22:25 EDT
Taking the bug and keeping open for some more time, but given comments 1,2,3 there is not much we can do there, no evidence of oVirt issue.
Comment 6 Francesco Romani 2017-07-12 04:33:04 EDT
actually, better to close this. Please reopen if there is new evidence.

Note You need to log in before you can comment on or make changes to this bug.