Bug 867412

Summary:	libvirt fails to clear async job when p2p migration fails early
Product:	Red Hat Enterprise Linux 6	Reporter:	Jiri Denemark <jdenemar>
Component:	libvirt	Assignee:	Jiri Denemark <jdenemar>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	low
Version:	6.4	CC:	acathrow, dyasny, dyuan, mzhan, rwu, weizhan, zhpeng
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	libvirt-0.10.2-5.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-02-21 07:10:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jiri Denemark 2012-10-17 13:16:23 UTC

Description of problem:

When p2p migration fails early because qemuMigrationIsAllowed or
qemuMigrationIsSafe say migration should be cancelled, we fail to clear
the migration-out async job. As a result of that, further APIs called
for the same domain may fail with Timed out during operation: cannot
acquire state change lock.

Version-Release number of selected component (if applicable):

libvirt-0.10.2-4.el6, introduced upstream in 0.9.5

How reproducible:

100%

Steps to Reproduce:
1. create and start a domain with disk on NFS and cache != none
2. virsh migrate --p2p $URI $DOM
3. virsh migrate --p2p $URI $DOM
  
Actual results:

Step 2 correctly results in:

error: Unsafe migration: Migration may lead to data corruption if disks use cache != none

Steps 3 will timeout after 30 seconds and report:

error: Timed out during operation: cannot acquire state change lock


Expected results:

No matter how many time we try to migrate the domain, it should still report error: Unsafe migration: Migration may lead to data corruption if disks use cache != none

Additional info:

Comment 1 Jiri Denemark 2012-10-17 13:17:47 UTC

Patch sent upstream: https://www.redhat.com/archives/libvir-list/2012-October/msg00891.html

Comment 2 zhpeng 2012-10-18 02:48:50 UTC

I can reproduce this with: libvirt-0.10.2-4.el6.x86_64

virsh # migrate aaa --p2p qemu+ssh://10.66.7.161/system --unsafe 
error: Timed out during operation: cannot acquire state change lock

Comment 3 Jiri Denemark 2012-10-18 09:17:58 UTC

Fixed upstream by v0.10.2-191-g837993d":

commit 837993d845a32bb222959a84d1c03a0c47f785be
Author: Jiri Denemark <jdenemar>
Date:   Wed Oct 17 14:08:17 2012 +0200

    qemu: Clear async job when p2p migration fails early
    
    When p2p migration fails early because qemuMigrationIsAllowed or
    qemuMigrationIsSafe say migration should be cancelled, we fail to clear
    the migration-out async job. As a result of that, further APIs called
    for the same domain may fail with Timed out during operation: cannot
    acquire state change lock.
    
    Reported by Guido Winkelmann.

Comment 4 Jiri Denemark 2012-10-18 09:18:58 UTC

In POST: http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-October/msg00920.html

Comment 6 zhpeng 2012-10-24 06:23:37 UTC

Test with:


# virsh dumpxml rhel63q
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='writeback'/>
      <source file='/virt/rhel63q.img'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>
...

# virsh migrate rhel63q --p2p qemu+ssh://10.66.7.161/system --unsafe

and  "default", "writethrough", "directsync" result are same.


So it's verified.

Comment 7 zhpeng 2012-11-22 03:41:26 UTC

correction：

 "default", "none", "writethrough", "writeback", and "unsafe" works well
our qemu not support "directsync"  yet.

Comment 8 errata-xmlrpc 2013-02-21 07:10:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html