Bug 1141159

Summary: libvirtd crash while resume the migrating guest in the target host
Product: Red Hat Enterprise Linux 7 Reporter: zhenfeng wang <zhwang>
Component: libvirtAssignee: Martin Kletzander <mkletzan>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.1CC: dyuan, lhuang, mkletzan, mzhan, rbalakri, ydu, zhwang, zpeng
Target Milestone: rcKeywords: Upstream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-1.2.13-1.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-19 05:47:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
The libvirtd crash dump info
none
The log about libvirtd none

Description zhenfeng wang 2014-09-12 11:00:15 UTC
Description of problem:
libvirtd crash while resume the migrating guest in the target host

Version-Release number of selected component (if applicable):
host:
qemu-kvm-rhev-2.1.0-3.el7.x86_64
libvirt-1.2.8-2.el7.x86_64
kernel-3.10.0-155.el7.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Start a guest on source host which image is on local disk(without shared with target host)
# virsh list
 Id    Name                           State
----------------------------------------------------
 5     rhel62                          running

2. Create a empty image on target host with the same size, directory and name as in source host

# qemu-img create -f qcow2 /var/lib/libvirt/images/rhel62.img 8G

3.Resume the guest in the target while start the migration on the source, Meantime, repeatly quit the migration
and start the migration in the source host and resume the guest in the target host, the libvirtd will
crash while do the upper operation some times.

T1 means Terminal 1
T2 means Terminal 2

T1#virsh migrate --live --copy-storage-all rhel62 qemu+ssh://$target_ip/system --verbose
root@$target_ip's password:

T2# virsh resume rhel62
error: Failed to resume domain rhel62
error: Timed out during operation: cannot acquire state change lock

T2# virsh resume rhel62

T1#virsh migrate --live --copy-storage-all rhel62 qemu+ssh://$target_ip/system --verbose
root@$target_ip's password:
^Cerror: operation aborted: migration out: canceled by client

T1# virsh migrate --live --copy-storage-all rhel62 qemu+ssh://$target_ip/system --verbose
root@$target_ip's password:

T2# virsh resume rhel62
error: Failed to resume domain rhel62
error: Timed out during operation: cannot acquire state change lock

T1# virsh resume rhel62
error: Failed to resume domain rhel62
error: End of file while reading data: Input/output error
error: Failed to reconnect to the hypervisor

Actual results:
libvirtd crash

Expected results:
shouldn't crash

additional info:
The coredump info was in the attachment

Comment 1 zhenfeng wang 2014-09-12 11:01:10 UTC
Created attachment 936921 [details]
The libvirtd crash dump info

Comment 3 Martin Kletzander 2014-10-31 13:15:56 UTC
Could you please describe the step 3 more closely?  Do you mean something this?

T2# while :; do virsh resume rhel62; done
T1# while :; do timeout -s TERM 2s virsh migrate --live --copy-storage-all rhel62 qemu+ssh://$target_ip/system --verbose; done

Comment 4 zhenfeng wang 2014-11-18 02:17:30 UTC
Sorry to see this needinfo too late, i missed it previous, what i did in step 3 like following

1.do storage migrate from the source to the target
2.resume the guest in the target before the migration finished
3.cancel the migration on the source host while fail to resume the guest
in the target host 
4.Repeat step 1~3

Comment 5 zhenfeng wang 2014-11-18 07:50:52 UTC
Created attachment 958455 [details]
The log about libvirtd

Comment 6 Martin Kletzander 2014-11-21 12:07:28 UTC
I managed to find the root cause.  Libvirt codebase in qemu driver is based on the fact that one implicit reference in the list is enough to do basic things whenthe domain is locked and whenever it is unlocked during an API, it should get a job which increases the reference counter.  However, if waiting for the job times out, the reference is dropped, and when the object is getting unlocked, it might not exist any more (it might have been removed while the API was waiting for the job to be acquired).

It is a rare scenario, but definitely possible.  I'm reworking the internals to fix the reference counting issue.

Comment 8 Martin Kletzander 2014-12-23 06:17:22 UTC
Fixed upstream with v1.2.11-68-g540c339:

commit 540c339a2535ec30d79e5ef84d8f50a17bc60723
Author: Martin Kletzander <mkletzan>
Date:   Thu Dec 4 14:41:36 2014 +0100

    qemu: completely rework reference counting

Comment 11 vivian zhang 2015-05-22 05:04:20 UTC
I can produce this on build libvirt-1.2.8-2.el7.x86_64

verify it on build
libvirt-1.2.15-2.el7.x86_64


Steps:
1. prepare a guest with local img on source host
# virsh list
 Id    Name                           State
----------------------------------------------------
 16    rh7                            running


# virsh domblklist rh7
Target     Source
------------------------------------------------
vda        /var/lib/libvirt/images/rhel7.0-3.qcow2

2. create a blank img on target host
# qemu-img create -f qcow2 rhel7.0-3.qcow2 8G
Formatting 'rhel7.0-3.qcow2', fmt=qcow2 size=8589934592 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16


3. do following actions repeat several times
check libvirtd process ID before migration for source and target hosts
1. migrate the guest to the target host
on source:# virsh migrate --live --copy-storage-all rh7 qemu+ssh://10.66.106.26/system --verbose

2. run virsh resume command in the target host
on target:# virsh resume rh7
error: Failed to resume domain rh7
error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePrepare3Params)


3.after virsh resume command finish, then cancel the migration on the source host
on source:# virsh migrate --live --copy-storage-all rh7 qemu+ssh://10.66.106.26/system --verbose
root.106.26's password: 
^Cerror: operation aborted: migration out: canceled by client

4.re-start migrate to the target host
on source:# virsh migrate --live --copy-storage-all rh7 qemu+ssh://10.66.106.26/system --verbose

5.run virsh resume command in the target host
on target# virsh resume rh7

6.before virsh resume command finish , then cancel the migration on the source host
on source # virsh migrate --live --copy-storage-all rh7 qemu+ssh://10.66.106.26/system --verbose
root.106.26's password: 
^Cerror: operation aborted: migration out: canceled by client

on target:
# virsh resume rh7
error: Failed to resume domain rh7
error: Requested operation is not valid: domain is not running

7.re-start migration to the target host

on source# virsh migrate --live --copy-storage-all rh7 qemu+ssh://10.66.106.26/system --verbose

repeat step1-7 for 5-10 times, recheck libvirtd process ID on source and target hosts, it is not changing so move to verified for this bug

Comment 13 errata-xmlrpc 2015-11-19 05:47:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2202.html