1467806 – getVmIoTune failed during parallel disk migration

Bug 1467806 - getVmIoTune failed during parallel disk migration

Summary: getVmIoTune failed during parallel disk migration

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.19.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Francesco Romani
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-05 08:18 UTC by Markus Stockhausen
Modified:	2017-07-12 08:33 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-07-12 08:33:04 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
vdsm log of hypervisor (545.40 KB, application/zip) 2017-07-05 08:20 UTC, Markus Stockhausen	no flags	Details
View All

Description Markus Stockhausen 2017-07-05 08:18:58 UTC

Description of problem:

When running parallel disk migrations of one VM vdsm gets errors in getVmIoTune. VM state is declared as unknown in OVirt Web interface. A question mark is presented at the VM.

Version-Release number of selected component (if applicable):

OVirt 4.1.2 
vdsm 4.19.15

How reproducible:

100%

Steps to Reproduce:
1. Start VM with multiple disks (on NFS storage)
2. Start multiple disk migrations (5 in our case)

Actual results:

Migration gives errors. State of VM is set to unknwon.

Expected results:

Migrations should succeed.

Additional info:

VM disks are atteched by NFS protocol version 4.2. No cross checks with NFS 4.0/4.1 executed.

Comment 1 Markus Stockhausen 2017-07-05 08:20:12 UTC

Created attachment 1294492 [details]
vdsm log of hypervisor

Comment 2 Markus Stockhausen 2017-07-05 08:21:45 UTC

Log boils down to:

<<<< START VM
2017-07-04 21:23:31,459+0200 INFO  (vm/af68904d) [virt.vm] (vmId='af68904d-d140-4d02-a1d7-118696d6ade7') <?xml version='1.0' encoding='UTF-8'?>
<domain xmlns:ovirt="http://ovirt.org/vm/tune/1.0" type="kvm">
    <name>xxx_test</name>
...
<<<< Log infos about disk migrations
2017-07-04 21:38:12,714+0200 INFO  (jsonrpc/1) [vdsm.api] START snapshot
2017-07-04 21:38:34,359+0200 INFO  (jsonrpc/1) [vdsm.api] START snapshot
2017-07-04 21:38:54,653+0200 INFO  (jsonrpc/7) [vdsm.api] START snapshot
2017-07-04 21:39:15,289+0200 INFO  (jsonrpc/7) [vdsm.api] START snapshot
2017-07-04 21:39:30,138+0200 INFO  (jsonrpc/5) [vdsm.api] START diskReplicateStart
2017-07-04 21:39:38,552+0200 INFO  (jsonrpc/4) [vdsm.api] START snapshot
2017-07-04 21:39:53,699+0200 INFO  (jsonrpc/7) [vdsm.api] START diskReplicateStart
2017-07-04 21:40:08,717+0200 INFO  (jsonrpc/5) [vdsm.api] START diskReplicateStart

<<<< VDSM error 
2017-07-04 21:40:28,939+0200 ERROR (jsonrpc/4) [virt.vm] (vmId='af68904d-d140-4d02-a1d7-118696d6ade7') getVmIoTune failed (vm:2833)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 2818, in getIoTuneResponse
    libvirt.VIR_DOMAIN_AFFECT_LIVE)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 77, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)
2017-07-04 21:40:28,947+0200 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmIoTunePolicies succeeded in 30.02 seconds (__init__:533)

Comment 3 Markus Stockhausen 2017-07-07 11:19:25 UTC

We received those errors during storage timeouts. The reason is not ovirt side. it is only the evidence of the error.

Nevertheless one question remains: Can a blocked storage domain lead to a long held a lock in remoteDispatchConnectGetAllDomainStats (libvirt I quess)? This should be a kvm domain only check command.

Comment 4 Francesco Romani 2017-07-12 08:21:28 UTC

(In reply to Markus Stockhausen from comment #3)
> We received those errors during storage timeouts. The reason is not ovirt
> side. it is only the evidence of the error.
> 
> Nevertheless one question remains: Can a blocked storage domain lead to a
> long held a lock in remoteDispatchConnectGetAllDomainStats (libvirt I
> quess)? This should be a kvm domain only check command.

Hi Markus.

Unfortunately, the answer is "yes". We depend a libvirt to get the state of the system, and if one domain is busy performing expensive state changes, like disk replication, other operations may experience long delay. This is because
the qemu/kvm command protocol is stricly sequential.

So it boils down to the lower layers and their behaviour.
What oVirt can do, however, is minimize the queries to libvirt. We added and we are adding caches where it makes sense, leveraging the fact that oVirt is the owner of the nodes - noone could make changes outside oVirt's control.

The work we did to fix https://bugzilla.redhat.com/show_bug.cgi?id=1443654 should help in this case as well. Please try Vdsm 4.19.16 and/or oVirt 4.1.3.

Comment 5 Francesco Romani 2017-07-12 08:22:25 UTC

Taking the bug and keeping open for some more time, but given comments 1,2,3 there is not much we can do there, no evidence of oVirt issue.

Comment 6 Francesco Romani 2017-07-12 08:33:04 UTC

actually, better to close this. Please reopen if there is new evidence.

Note You need to log in before you can comment on or make changes to this bug.