Bug 1141159
Summary: | libvirtd crash while resume the migrating guest in the target host | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | zhenfeng wang <zhwang> | ||||||
Component: | libvirt | Assignee: | Martin Kletzander <mkletzan> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.1 | CC: | dyuan, lhuang, mkletzan, mzhan, rbalakri, ydu, zhwang, zpeng | ||||||
Target Milestone: | rc | Keywords: | Upstream | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | libvirt-1.2.13-1.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-11-19 05:47:06 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
zhenfeng wang
2014-09-12 11:00:15 UTC
Created attachment 936921 [details]
The libvirtd crash dump info
Could you please describe the step 3 more closely? Do you mean something this? T2# while :; do virsh resume rhel62; done T1# while :; do timeout -s TERM 2s virsh migrate --live --copy-storage-all rhel62 qemu+ssh://$target_ip/system --verbose; done Sorry to see this needinfo too late, i missed it previous, what i did in step 3 like following 1.do storage migrate from the source to the target 2.resume the guest in the target before the migration finished 3.cancel the migration on the source host while fail to resume the guest in the target host 4.Repeat step 1~3 Created attachment 958455 [details]
The log about libvirtd
I managed to find the root cause. Libvirt codebase in qemu driver is based on the fact that one implicit reference in the list is enough to do basic things whenthe domain is locked and whenever it is unlocked during an API, it should get a job which increases the reference counter. However, if waiting for the job times out, the reference is dropped, and when the object is getting unlocked, it might not exist any more (it might have been removed while the API was waiting for the job to be acquired). It is a rare scenario, but definitely possible. I'm reworking the internals to fix the reference counting issue. Fixed upstream with v1.2.11-68-g540c339: commit 540c339a2535ec30d79e5ef84d8f50a17bc60723 Author: Martin Kletzander <mkletzan> Date: Thu Dec 4 14:41:36 2014 +0100 qemu: completely rework reference counting I can produce this on build libvirt-1.2.8-2.el7.x86_64 verify it on build libvirt-1.2.15-2.el7.x86_64 Steps: 1. prepare a guest with local img on source host # virsh list Id Name State ---------------------------------------------------- 16 rh7 running # virsh domblklist rh7 Target Source ------------------------------------------------ vda /var/lib/libvirt/images/rhel7.0-3.qcow2 2. create a blank img on target host # qemu-img create -f qcow2 rhel7.0-3.qcow2 8G Formatting 'rhel7.0-3.qcow2', fmt=qcow2 size=8589934592 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 3. do following actions repeat several times check libvirtd process ID before migration for source and target hosts 1. migrate the guest to the target host on source:# virsh migrate --live --copy-storage-all rh7 qemu+ssh://10.66.106.26/system --verbose 2. run virsh resume command in the target host on target:# virsh resume rh7 error: Failed to resume domain rh7 error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePrepare3Params) 3.after virsh resume command finish, then cancel the migration on the source host on source:# virsh migrate --live --copy-storage-all rh7 qemu+ssh://10.66.106.26/system --verbose root.106.26's password: ^Cerror: operation aborted: migration out: canceled by client 4.re-start migrate to the target host on source:# virsh migrate --live --copy-storage-all rh7 qemu+ssh://10.66.106.26/system --verbose 5.run virsh resume command in the target host on target# virsh resume rh7 6.before virsh resume command finish , then cancel the migration on the source host on source # virsh migrate --live --copy-storage-all rh7 qemu+ssh://10.66.106.26/system --verbose root.106.26's password: ^Cerror: operation aborted: migration out: canceled by client on target: # virsh resume rh7 error: Failed to resume domain rh7 error: Requested operation is not valid: domain is not running 7.re-start migration to the target host on source# virsh migrate --live --copy-storage-all rh7 qemu+ssh://10.66.106.26/system --verbose repeat step1-7 for 5-10 times, recheck libvirtd process ID on source and target hosts, it is not changing so move to verified for this bug Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2202.html |