Bug 659310
| Summary: | [libvirt] deadlock on concurrent multiple bidirectional migration | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Haim <hateya> | ||||
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 6.1 | CC: | bazulay, berrange, cpelland, dallan, danken, dyuan, eblake, hateya, iheim, jdenemar, juzhang, mgoldboi, mjenner, mkenneth, plyons, riek, vbian, weizhan, xen-maint, yeylon, yimwang, ykaul | ||||
| Target Milestone: | rc | Keywords: | TestBlocker, ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | libvirt-0.8.6-1.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: |
A deadlock occurred in the libvirt service when running concurrent bidirectional migration because certain calls did not release their local driver lock before issuing an RPC (Remote Procedure Call) call on a remote libvirt daemon. A deadlock no longer occurs between two communicating libvirt daemons.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-05-19 13:24:37 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 629625, 662043 | ||||||
| Attachments: |
|
||||||
Thread 2 is showing the problem area
#3 0x000000353688e370 in call (conn=0x7f087809eb40, priv=0x7f0878045240, flags=0, proc_nr=<value optimized out>,
args_filter=0x353689fd50 <xdr_remote_supports_feature_args>, args=<value optimized out>, ret_filter=0x353689fd30 <xdr_remote_supports_feature_ret>,
ret=0x7f0895ff58f0 "") at remote/remote_driver.c:9938
#4 0x0000003536898ca0 in remoteSupportsFeature (conn=0x7f087809eb40, feature=4) at remote/remote_driver.c:1513
#5 0x0000000000451de7 in doPeer2PeerMigrate (dom=0x7f087809c990, cookie=0x0, cookielen=2013915968,
uri=0x7f08780a1b30 "qemu+tls://nott-vds1.qa.lab.tlv.redhat.com/system", flags=3, dname=0x0, resource=0) at qemu/qemu_driver.c:11547
#6 qemudDomainMigratePerform (dom=0x7f087809c990, cookie=0x0, cookielen=2013915968, uri=0x7f08780a1b30 "qemu+tls://nott-vds1.qa.lab.tlv.redhat.com/system",
flags=3, dname=0x0, resource=0) at qemu/qemu_driver.c:11631
It is making a call into the remote libvirtd, while still holding onto the driver lock. The virConnectOpen and VIR_DRV_SUPPORTS_FEATURE API calls need to be surrounded with calls to qemuDomainObjEnterRemoteWithDriver and qemuDomainObjExitRemoteWithDriver as per commit f0c8e1cb3774d6f09e2681ca1988bf235a343007
This is fixed upstream by v0.8.6-38-g4186f92 and v0.8.6-39-g584c13f:
commit 4186f92935e6bb5057b2db14f47dfd817ab0ab84
Author: Jiri Denemark <jdenemar>
Date: Fri Dec 3 09:31:48 2010 +0100
Change return value of VIR_DRV_SUPPORTS_FEATURE to bool
virDrvSupportsFeature API is allowed to return -1 on error while all but
one uses of VIR_DRV_SUPPORTS_FEATURE only check for (non)zero return
value. Let's make this macro return zero on error, which is what
everyone expects anyway.
commit 584c13f3560fca894c568db39b81a856db1387cb
Author: Jiri Denemark <jdenemar>
Date: Fri Dec 3 10:48:31 2010 +0100
qemu: Fix a possible deadlock in p2p migration
Two more calls to remote libvirtd have to be surrounded by
qemuDomainObjEnterRemoteWithDriver() and
qemuDomainObjExitRemoteWithDriver() to prevent possible deadlock between
two communicating libvirt daemons.
See commit f0c8e1cb3774d6f09e2681ca1988bf235a343007 for further details.
verified it PASSED with build :
libvirt-0.8.1-29.el6.x86_64
libvirt-client-0.8.1-29.el6.x86_64
qemu-kvm-0.12.1.2-2.128.el6.x86_64
qemu-img-0.12.1.2-2.128.el6.x86_64
kernel-2.6.32-93.el6.x86_64
Steps:
1.Create 10 VMS on each side.
#iptables -F
# setsebool -P virt_use_nfs 1
2.Dispatch ssh publick key of source host to target host
#ssh-keygen -t rsa
#ssh-copy-id -i ~/.ssh/id_rsa.pub root@hostIP
3. Start VMS On each side.
#for i in {11..20};do virsh start vm$i;done
or
#for i in {1..10};do virsh start vm$i;done
4.Running concurrent bidirectional migration
On server 1:
# for i in `seq 11 20`;do time virsh migrate --live vm$i qemu+ssh://10.66.93.59/system ; virsh list --all; done
On server 2:
# for i in `seq 1 10`;do time virsh migrate --live vm$i qemu+ssh://10.66.93.206/system ; virsh list --all; done
5. Check the output of step 4, virsh list works fine.
Verify on
kernel-2.6.32-92.el6.x86_64
libvirt-0.8.6-1.el6.x86_64
qemu-kvm-0.12.1.2-2.128.el6.x86_64
migration on both sides with 10 guests each concurrently can finish successfully, but at the beginning several seconds of migration, if I run "virsh list", there will be some errors:
# virsh list
error: cannot send data: Broken pipe
error: failed to connect to the hypervisor
# virsh list
error: cannot recv data: : Connection reset by peer
error: failed to connect to the hypervisor
But after several seconds, it will list normally, which means libvirtd still alive
# virsh list
Id Name State
----------------------------------
176 mig1 running
177 mig2 running
178 mig3 running
179 mig4 running
180 mig5 running
181 mig6 running
182 mig7 running
183 mig8 running
184 mig9 running
185 mig18 paused
186 mig13 paused
187 mig10 paused
188 mig11 paused
189 mig16 paused
190 mig14 paused
191 mig15 paused
192 mig19 paused
193 mig12 paused
the guest is set to 256M mem size and both of my 2 machines has 8G mem
I use the command:
#!/bin/sh
for i in {0..9};
do
ssh root@{ip addr} ./migrate-cmd.sh mig1$i &
./migrate-cmd.sh mig$i
done
in migrate-cmd.sh on both sides
#!/bin/sh
virsh migrate --live $1 qemu+ssh://{ip addr}/system &
So I want to ask is that a bug?
You're likely hitting connection limit in libvirtd. Each virsh migrate results in one connection to source and once connection to target. So you end up with 20 open connections to each libvirtd, which is the default limit on number of connections. Try increasing max_clients in /etc/libvirt/libvirtd.conf on both hosts. (In reply to comment #9) > You're likely hitting connection limit in libvirtd. Each virsh migrate results > in one connection to source and once connection to target. So you end up with > 20 open connections to each libvirtd, which is the default limit on number of > connections. Try increasing max_clients in /etc/libvirt/libvirtd.conf on both > hosts. I think you are right, after I change the max_clients to 40, the error disappears. According to the comment 8 and comment 9, this bug is verified pass. kernel-2.6.32-94.el6.x86_64 libvirt-0.8.6-1.el6.x86_64 qemu-kvm-0.12.1.2-2.128.el6.x86_64 Verify on
kernel-2.6.32-113.el6.x86_64
libvirt-0.8.7-6.el6.x86_64
qemu-kvm-0.12.1.2-2.144.el6.x86_64
migration on both sides with 10 guests each concurrently can finish
successfully,it will list normally, which means libvirtd still
alive
# virsh list --all
Id Name State
----------------------------------
11 test1 running
12 test2 running
13 test3 running
14 test4 running
15 test5 running
16 test6 running
17 test7 running
18 test8 running
19 test9 running
20 test10 running
21 test19 paused
22 test14 paused
23 test18 paused
24 test15 paused
25 test13 paused
26 test12 paused
27 test11 paused
28 test16 paused
29 test20 paused
30 test17 paused
use the command:
#!/bin/sh
for i in {0..9};
do
ssh root@{ip addr} ./migrate-cmd.sh mig1$i &
./migrate-cmd.sh mig$i
done
in migrate-cmd.sh on both sides
#!/bin/sh
virsh migrate --live $1 qemu+ssh://{ip addr}/system &
So set bug status to VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
A deadlock occurred in the libvirt service when running concurrent bidirectional migration because certain calls did not release their local driver lock before issuing an RPC (Remote Procedure Call) call on a remote libvirt daemon. A deadlock no longer occurs between two communicating libvirt daemons.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0596.html |
Created attachment 464253 [details] gdb Description of problem: libvirt service is in deadlock when running concurrent bidirectional migration. when i try to hit 'virsh list' i don't get any response from shell. attached gdb output. repro steps: 1) run with 2 hosts 2) run migrate command from both hosts (concurrently) so vm running on server 1 will migrate to server 2 and vms running from server 2 will migrate to server 1. libvirt-0.8.1-28.el6.x86_64 2.6.32-71.7.1.el6.x86_64 Red Hat Enterprise Linux Server release 6.0 (Santiago)