Bug 659310
Summary: | [libvirt] deadlock on concurrent multiple bidirectional migration | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Haim <hateya> | ||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 6.1 | CC: | bazulay, berrange, cpelland, dallan, danken, dyuan, eblake, hateya, iheim, jdenemar, juzhang, mgoldboi, mjenner, mkenneth, plyons, riek, vbian, weizhan, xen-maint, yeylon, yimwang, ykaul | ||||
Target Milestone: | rc | Keywords: | TestBlocker, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libvirt-0.8.6-1.el6 | Doc Type: | Bug Fix | ||||
Doc Text: |
A deadlock occurred in the libvirt service when running concurrent bidirectional migration because certain calls did not release their local driver lock before issuing an RPC (Remote Procedure Call) call on a remote libvirt daemon. A deadlock no longer occurs between two communicating libvirt daemons.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-05-19 13:24:37 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 629625, 662043 | ||||||
Attachments: |
|
Thread 2 is showing the problem area #3 0x000000353688e370 in call (conn=0x7f087809eb40, priv=0x7f0878045240, flags=0, proc_nr=<value optimized out>, args_filter=0x353689fd50 <xdr_remote_supports_feature_args>, args=<value optimized out>, ret_filter=0x353689fd30 <xdr_remote_supports_feature_ret>, ret=0x7f0895ff58f0 "") at remote/remote_driver.c:9938 #4 0x0000003536898ca0 in remoteSupportsFeature (conn=0x7f087809eb40, feature=4) at remote/remote_driver.c:1513 #5 0x0000000000451de7 in doPeer2PeerMigrate (dom=0x7f087809c990, cookie=0x0, cookielen=2013915968, uri=0x7f08780a1b30 "qemu+tls://nott-vds1.qa.lab.tlv.redhat.com/system", flags=3, dname=0x0, resource=0) at qemu/qemu_driver.c:11547 #6 qemudDomainMigratePerform (dom=0x7f087809c990, cookie=0x0, cookielen=2013915968, uri=0x7f08780a1b30 "qemu+tls://nott-vds1.qa.lab.tlv.redhat.com/system", flags=3, dname=0x0, resource=0) at qemu/qemu_driver.c:11631 It is making a call into the remote libvirtd, while still holding onto the driver lock. The virConnectOpen and VIR_DRV_SUPPORTS_FEATURE API calls need to be surrounded with calls to qemuDomainObjEnterRemoteWithDriver and qemuDomainObjExitRemoteWithDriver as per commit f0c8e1cb3774d6f09e2681ca1988bf235a343007 This is fixed upstream by v0.8.6-38-g4186f92 and v0.8.6-39-g584c13f: commit 4186f92935e6bb5057b2db14f47dfd817ab0ab84 Author: Jiri Denemark <jdenemar> Date: Fri Dec 3 09:31:48 2010 +0100 Change return value of VIR_DRV_SUPPORTS_FEATURE to bool virDrvSupportsFeature API is allowed to return -1 on error while all but one uses of VIR_DRV_SUPPORTS_FEATURE only check for (non)zero return value. Let's make this macro return zero on error, which is what everyone expects anyway. commit 584c13f3560fca894c568db39b81a856db1387cb Author: Jiri Denemark <jdenemar> Date: Fri Dec 3 10:48:31 2010 +0100 qemu: Fix a possible deadlock in p2p migration Two more calls to remote libvirtd have to be surrounded by qemuDomainObjEnterRemoteWithDriver() and qemuDomainObjExitRemoteWithDriver() to prevent possible deadlock between two communicating libvirt daemons. See commit f0c8e1cb3774d6f09e2681ca1988bf235a343007 for further details. verified it PASSED with build : libvirt-0.8.1-29.el6.x86_64 libvirt-client-0.8.1-29.el6.x86_64 qemu-kvm-0.12.1.2-2.128.el6.x86_64 qemu-img-0.12.1.2-2.128.el6.x86_64 kernel-2.6.32-93.el6.x86_64 Steps: 1.Create 10 VMS on each side. #iptables -F # setsebool -P virt_use_nfs 1 2.Dispatch ssh publick key of source host to target host #ssh-keygen -t rsa #ssh-copy-id -i ~/.ssh/id_rsa.pub root@hostIP 3. Start VMS On each side. #for i in {11..20};do virsh start vm$i;done or #for i in {1..10};do virsh start vm$i;done 4.Running concurrent bidirectional migration On server 1: # for i in `seq 11 20`;do time virsh migrate --live vm$i qemu+ssh://10.66.93.59/system ; virsh list --all; done On server 2: # for i in `seq 1 10`;do time virsh migrate --live vm$i qemu+ssh://10.66.93.206/system ; virsh list --all; done 5. Check the output of step 4, virsh list works fine. Verify on kernel-2.6.32-92.el6.x86_64 libvirt-0.8.6-1.el6.x86_64 qemu-kvm-0.12.1.2-2.128.el6.x86_64 migration on both sides with 10 guests each concurrently can finish successfully, but at the beginning several seconds of migration, if I run "virsh list", there will be some errors: # virsh list error: cannot send data: Broken pipe error: failed to connect to the hypervisor # virsh list error: cannot recv data: : Connection reset by peer error: failed to connect to the hypervisor But after several seconds, it will list normally, which means libvirtd still alive # virsh list Id Name State ---------------------------------- 176 mig1 running 177 mig2 running 178 mig3 running 179 mig4 running 180 mig5 running 181 mig6 running 182 mig7 running 183 mig8 running 184 mig9 running 185 mig18 paused 186 mig13 paused 187 mig10 paused 188 mig11 paused 189 mig16 paused 190 mig14 paused 191 mig15 paused 192 mig19 paused 193 mig12 paused the guest is set to 256M mem size and both of my 2 machines has 8G mem I use the command: #!/bin/sh for i in {0..9}; do ssh root@{ip addr} ./migrate-cmd.sh mig1$i & ./migrate-cmd.sh mig$i done in migrate-cmd.sh on both sides #!/bin/sh virsh migrate --live $1 qemu+ssh://{ip addr}/system & So I want to ask is that a bug? You're likely hitting connection limit in libvirtd. Each virsh migrate results in one connection to source and once connection to target. So you end up with 20 open connections to each libvirtd, which is the default limit on number of connections. Try increasing max_clients in /etc/libvirt/libvirtd.conf on both hosts. (In reply to comment #9) > You're likely hitting connection limit in libvirtd. Each virsh migrate results > in one connection to source and once connection to target. So you end up with > 20 open connections to each libvirtd, which is the default limit on number of > connections. Try increasing max_clients in /etc/libvirt/libvirtd.conf on both > hosts. I think you are right, after I change the max_clients to 40, the error disappears. According to the comment 8 and comment 9, this bug is verified pass. kernel-2.6.32-94.el6.x86_64 libvirt-0.8.6-1.el6.x86_64 qemu-kvm-0.12.1.2-2.128.el6.x86_64 Verify on kernel-2.6.32-113.el6.x86_64 libvirt-0.8.7-6.el6.x86_64 qemu-kvm-0.12.1.2-2.144.el6.x86_64 migration on both sides with 10 guests each concurrently can finish successfully,it will list normally, which means libvirtd still alive # virsh list --all Id Name State ---------------------------------- 11 test1 running 12 test2 running 13 test3 running 14 test4 running 15 test5 running 16 test6 running 17 test7 running 18 test8 running 19 test9 running 20 test10 running 21 test19 paused 22 test14 paused 23 test18 paused 24 test15 paused 25 test13 paused 26 test12 paused 27 test11 paused 28 test16 paused 29 test20 paused 30 test17 paused use the command: #!/bin/sh for i in {0..9}; do ssh root@{ip addr} ./migrate-cmd.sh mig1$i & ./migrate-cmd.sh mig$i done in migrate-cmd.sh on both sides #!/bin/sh virsh migrate --live $1 qemu+ssh://{ip addr}/system & So set bug status to VERIFIED Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A deadlock occurred in the libvirt service when running concurrent bidirectional migration because certain calls did not release their local driver lock before issuing an RPC (Remote Procedure Call) call on a remote libvirt daemon. A deadlock no longer occurs between two communicating libvirt daemons. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0596.html |
Created attachment 464253 [details] gdb Description of problem: libvirt service is in deadlock when running concurrent bidirectional migration. when i try to hit 'virsh list' i don't get any response from shell. attached gdb output. repro steps: 1) run with 2 hosts 2) run migrate command from both hosts (concurrently) so vm running on server 1 will migrate to server 2 and vms running from server 2 will migrate to server 1. libvirt-0.8.1-28.el6.x86_64 2.6.32-71.7.1.el6.x86_64 Red Hat Enterprise Linux Server release 6.0 (Santiago)