Bug 867412

Summary: libvirt fails to clear async job when p2p migration fails early
Product: Red Hat Enterprise Linux 6 Reporter: Jiri Denemark <jdenemar>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 6.4CC: acathrow, dyasny, dyuan, mzhan, rwu, weizhan, zhpeng
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-0.10.2-5.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 07:10:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jiri Denemark 2012-10-17 13:16:23 UTC
Description of problem:

When p2p migration fails early because qemuMigrationIsAllowed or
qemuMigrationIsSafe say migration should be cancelled, we fail to clear
the migration-out async job. As a result of that, further APIs called
for the same domain may fail with Timed out during operation: cannot
acquire state change lock.

Version-Release number of selected component (if applicable):

libvirt-0.10.2-4.el6, introduced upstream in 0.9.5

How reproducible:

100%

Steps to Reproduce:
1. create and start a domain with disk on NFS and cache != none
2. virsh migrate --p2p $URI $DOM
3. virsh migrate --p2p $URI $DOM
  
Actual results:

Step 2 correctly results in:

error: Unsafe migration: Migration may lead to data corruption if disks use cache != none

Steps 3 will timeout after 30 seconds and report:

error: Timed out during operation: cannot acquire state change lock


Expected results:

No matter how many time we try to migrate the domain, it should still report error: Unsafe migration: Migration may lead to data corruption if disks use cache != none

Additional info:

Comment 1 Jiri Denemark 2012-10-17 13:17:47 UTC
Patch sent upstream: https://www.redhat.com/archives/libvir-list/2012-October/msg00891.html

Comment 2 zhpeng 2012-10-18 02:48:50 UTC
I can reproduce this with: libvirt-0.10.2-4.el6.x86_64

virsh # migrate aaa --p2p qemu+ssh://10.66.7.161/system --unsafe 
error: Timed out during operation: cannot acquire state change lock

Comment 3 Jiri Denemark 2012-10-18 09:17:58 UTC
Fixed upstream by v0.10.2-191-g837993d":

commit 837993d845a32bb222959a84d1c03a0c47f785be
Author: Jiri Denemark <jdenemar>
Date:   Wed Oct 17 14:08:17 2012 +0200

    qemu: Clear async job when p2p migration fails early
    
    When p2p migration fails early because qemuMigrationIsAllowed or
    qemuMigrationIsSafe say migration should be cancelled, we fail to clear
    the migration-out async job. As a result of that, further APIs called
    for the same domain may fail with Timed out during operation: cannot
    acquire state change lock.
    
    Reported by Guido Winkelmann.

Comment 6 zhpeng 2012-10-24 06:23:37 UTC
Test with:


# virsh dumpxml rhel63q
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='writeback'/>
      <source file='/virt/rhel63q.img'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>
...

# virsh migrate rhel63q --p2p qemu+ssh://10.66.7.161/system --unsafe

and  "default", "writethrough", "directsync" result are same.


So it's verified.

Comment 7 zhpeng 2012-11-22 03:41:26 UTC
correction:

 "default", "none", "writethrough", "writeback", and "unsafe" works well
our qemu not support "directsync"  yet.

Comment 8 errata-xmlrpc 2013-02-21 07:10:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html