Bug 659310

Summary:

[libvirt] deadlock on concurrent multiple bidirectional migration

Product:

Red Hat Enterprise Linux 6

Reporter:

Haim <hateya>

Component:

libvirt

Assignee:

Jiri Denemark <jdenemar>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.1

CC:

bazulay, berrange, cpelland, dallan, danken, dyuan, eblake, hateya, iheim, jdenemar, juzhang, mgoldboi, mjenner, mkenneth, plyons, riek, vbian, weizhan, xen-maint, yeylon, yimwang, ykaul

Target Milestone:

Keywords:

TestBlocker, ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

libvirt-0.8.6-1.el6

Doc Type:

Bug Fix

Doc Text:

A deadlock occurred in the libvirt service when running concurrent bidirectional migration because certain calls did not release their local driver lock before issuing an RPC (Remote Procedure Call) call on a remote libvirt daemon. A deadlock no longer occurs between two communicating libvirt daemons.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-05-19 13:24:37 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

629625, 662043

Attachments:

Description	Flags
gdb	none

Description Haim 2010-12-02 13:52:17 UTC

Created attachment 464253 [details]
gdb

Description of problem:

libvirt service is in deadlock when running concurrent bidirectional migration. 
when i try to hit 'virsh list' i don't get any response from shell. 

attached gdb output. 

repro steps:

1) run with 2 hosts 
2) run migrate command from both hosts (concurrently) so vm running on server 1 will migrate to server 2 and vms running from server 2 will migrate to server 1.

libvirt-0.8.1-28.el6.x86_64
2.6.32-71.7.1.el6.x86_64
Red Hat Enterprise Linux Server release 6.0 (Santiago)

Comment 3 Daniel Berrangé 2010-12-02 14:29:14 UTC

Thread 2 is showing the problem area

#3  0x000000353688e370 in call (conn=0x7f087809eb40, priv=0x7f0878045240, flags=0, proc_nr=<value optimized out>, 
    args_filter=0x353689fd50 <xdr_remote_supports_feature_args>, args=<value optimized out>, ret_filter=0x353689fd30 <xdr_remote_supports_feature_ret>, 
    ret=0x7f0895ff58f0 "") at remote/remote_driver.c:9938
#4  0x0000003536898ca0 in remoteSupportsFeature (conn=0x7f087809eb40, feature=4) at remote/remote_driver.c:1513
#5  0x0000000000451de7 in doPeer2PeerMigrate (dom=0x7f087809c990, cookie=0x0, cookielen=2013915968, 
    uri=0x7f08780a1b30 "qemu+tls://nott-vds1.qa.lab.tlv.redhat.com/system", flags=3, dname=0x0, resource=0) at qemu/qemu_driver.c:11547
#6  qemudDomainMigratePerform (dom=0x7f087809c990, cookie=0x0, cookielen=2013915968, uri=0x7f08780a1b30 "qemu+tls://nott-vds1.qa.lab.tlv.redhat.com/system", 
    flags=3, dname=0x0, resource=0) at qemu/qemu_driver.c:11631


It is making a call into the remote libvirtd, while still holding onto the driver lock. The virConnectOpen and VIR_DRV_SUPPORTS_FEATURE  API calls need to be surrounded with calls to qemuDomainObjEnterRemoteWithDriver and qemuDomainObjExitRemoteWithDriver as per commit f0c8e1cb3774d6f09e2681ca1988bf235a343007

Comment 5 Jiri Denemark 2010-12-06 14:19:46 UTC

This is fixed upstream by v0.8.6-38-g4186f92 and v0.8.6-39-g584c13f:

commit 4186f92935e6bb5057b2db14f47dfd817ab0ab84
Author: Jiri Denemark <jdenemar>
Date:   Fri Dec 3 09:31:48 2010 +0100

    Change return value of VIR_DRV_SUPPORTS_FEATURE to bool
    
    virDrvSupportsFeature API is allowed to return -1 on error while all but
    one uses of VIR_DRV_SUPPORTS_FEATURE only check for (non)zero return
    value. Let's make this macro return zero on error, which is what
    everyone expects anyway.

commit 584c13f3560fca894c568db39b81a856db1387cb
Author: Jiri Denemark <jdenemar>
Date:   Fri Dec 3 10:48:31 2010 +0100

    qemu: Fix a possible deadlock in p2p migration
    
    Two more calls to remote libvirtd have to be surrounded by
    qemuDomainObjEnterRemoteWithDriver() and
    qemuDomainObjExitRemoteWithDriver() to prevent possible deadlock between
    two communicating libvirt daemons.
    
    See commit f0c8e1cb3774d6f09e2681ca1988bf235a343007 for further details.

Comment 7 wangyimiao 2010-12-22 07:55:46 UTC

verified it PASSED with build :
libvirt-0.8.1-29.el6.x86_64
libvirt-client-0.8.1-29.el6.x86_64
qemu-kvm-0.12.1.2-2.128.el6.x86_64
qemu-img-0.12.1.2-2.128.el6.x86_64
kernel-2.6.32-93.el6.x86_64

Steps:

1.Create 10 VMS on each side.
#iptables -F
# setsebool -P virt_use_nfs 1

2.Dispatch ssh publick key of source host to target host
#ssh-keygen -t rsa
#ssh-copy-id -i ~/.ssh/id_rsa.pub root@hostIP

3. Start VMS On each side. 
#for i in {11..20};do virsh start vm$i;done
or
#for i in {1..10};do virsh start vm$i;done

4.Running concurrent bidirectional migration
On server 1:
# for i in `seq 11 20`;do time virsh migrate --live  vm$i qemu+ssh://10.66.93.59/system ; virsh list --all; done

On server 2:
# for i in `seq 1 10`;do time virsh migrate --live  vm$i qemu+ssh://10.66.93.206/system ; virsh list --all; done

5. Check the output of step 4, virsh list works fine.

Comment 8 weizhang 2010-12-27 09:18:32 UTC

Verify on 
kernel-2.6.32-92.el6.x86_64
libvirt-0.8.6-1.el6.x86_64
qemu-kvm-0.12.1.2-2.128.el6.x86_64

migration on both sides with 10 guests each concurrently can finish successfully, but at the beginning several seconds of migration, if I run "virsh list", there will be some errors:
# virsh list
error: cannot send data: Broken pipe
error: failed to connect to the hypervisor
# virsh list
error: cannot recv data: : Connection reset by peer
error: failed to connect to the hypervisor

But after several seconds, it will list normally, which means libvirtd still alive
# virsh list
 Id Name                 State
----------------------------------
176 mig1                 running
177 mig2                 running
178 mig3                 running
179 mig4                 running
180 mig5                 running
181 mig6                 running
182 mig7                 running
183 mig8                 running
184 mig9                 running
185 mig18                paused
186 mig13                paused
187 mig10                paused
188 mig11                paused
189 mig16                paused
190 mig14                paused
191 mig15                paused
192 mig19                paused
193 mig12                paused

the guest is set to 256M mem size and both of my 2 machines has 8G mem

I use the command:
#!/bin/sh

for i in {0..9};
do
        ssh root@{ip addr} ./migrate-cmd.sh mig1$i &
        ./migrate-cmd.sh mig$i
done

in migrate-cmd.sh on both sides
#!/bin/sh
virsh migrate --live $1 qemu+ssh://{ip addr}/system &


So I want to ask is that a bug?

Comment 9 Jiri Denemark 2011-01-06 14:23:52 UTC

You're likely hitting connection limit in libvirtd. Each virsh migrate results in one connection to source and once connection to target. So you end up with 20 open connections to each libvirtd, which is the default limit on number of connections. Try increasing max_clients in /etc/libvirt/libvirtd.conf on both hosts.

Comment 10 weizhang 2011-01-07 09:35:58 UTC

(In reply to comment #9)
> You're likely hitting connection limit in libvirtd. Each virsh migrate results
> in one connection to source and once connection to target. So you end up with
> 20 open connections to each libvirtd, which is the default limit on number of
> connections. Try increasing max_clients in /etc/libvirt/libvirtd.conf on both
> hosts.

I think you are right, after I change the max_clients to 40, the error disappears.
According to the comment 8 and comment 9, this bug is verified pass.
kernel-2.6.32-94.el6.x86_64
libvirt-0.8.6-1.el6.x86_64
qemu-kvm-0.12.1.2-2.128.el6.x86_64

Comment 12 Vivian Bian 2011-02-16 10:58:31 UTC

Verify on 
kernel-2.6.32-113.el6.x86_64
libvirt-0.8.7-6.el6.x86_64
qemu-kvm-0.12.1.2-2.144.el6.x86_64

migration on both sides with 10 guests each concurrently can finish
successfully,it will list normally, which means libvirtd still
alive

# virsh list --all
 Id Name                 State
----------------------------------
 11 test1                running
 12 test2                running
 13 test3                running
 14 test4                running
 15 test5                running
 16 test6                running
 17 test7                running
 18 test8                running
 19 test9                running
 20 test10               running
 21 test19               paused
 22 test14               paused
 23 test18               paused
 24 test15               paused
 25 test13               paused
 26 test12               paused
 27 test11               paused
 28 test16               paused
 29 test20               paused
 30 test17               paused


use the command:
#!/bin/sh

for i in {0..9};
do
        ssh root@{ip addr} ./migrate-cmd.sh mig1$i &
        ./migrate-cmd.sh mig$i
done

in migrate-cmd.sh on both sides
#!/bin/sh
virsh migrate --live $1 qemu+ssh://{ip addr}/system &


So set bug status to VERIFIED

Comment 13 Martin Prpič 2011-04-15 14:21:59 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A deadlock occurred in the libvirt service when running concurrent bidirectional migration because certain calls did not release their local driver lock before issuing an RPC (Remote Procedure Call) call on a remote libvirt daemon. A deadlock no longer occurs between two communicating libvirt daemons.

Comment 16 errata-xmlrpc 2011-05-19 13:24:37 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html