This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 856950 - Deadlock on libvirt when playing with hotplug and add/remove vm
Deadlock on libvirt when playing with hotplug and add/remove vm
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.3
x86_64 Linux
urgent Severity urgent
: rc
: ---
Assigned To: Michal Privoznik
GenadiC
: ZStream
: 875710 (view as bug list)
Depends On:
Blocks: 875710 875788 876102
  Show dependency treegraph
 
Reported: 2012-09-13 04:30 EDT by GenadiC
Modified: 2013-02-21 02:23 EST (History)
15 users (show)

See Also:
Fixed In Version: libvirt-0.10.2-0rc1.el6
Doc Type: Bug Fix
Doc Text:
When a qemu process is being destroyed by libvirt, a clean-up operation frees some internal structures and locks. However, since users can destroy qemu processes at the same time, libvirt holds the qemu driver lock to protect the list of domains and their states, among other things. Previously, a function tried to set up the qemu driver lock when it was already up, creating a deadlock. The code has been modified to always check if the lock is free before attempting to set it up, thus fixing this bug.
Story Points: ---
Clone Of:
: 875710 (view as bug list)
Environment:
Last Closed: 2013-02-21 02:23:43 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
gdb log (13.46 KB, text/plain)
2012-09-13 04:30 EDT, GenadiC
no flags Details
libvirt log (663.59 KB, application/x-xz)
2012-09-13 04:33 EDT, GenadiC
no flags Details

  None (edit)
Description GenadiC 2012-09-13 04:30:56 EDT
Created attachment 612372 [details]
gdb log

Description of problem:

After playing with hotplug/hotunplug and add/remove VM I got deadlock on libvirt
Log attached
Comment 1 GenadiC 2012-09-13 04:33:15 EDT
Created attachment 612373 [details]
libvirt log
Comment 3 Daniel Berrange 2012-09-13 07:53:08 EDT
This is the problematic thread:

Thread 1 (Thread 0x7f215745a860 (LWP 2594)):
#0  0x0000003e3500e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003e35009388 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x0000003e35009257 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000048b0ec in qemuProcessHandleAgentDestroy (agent=0x7f213000f5d0, 
    vm=0x7f213000c360) at qemu/qemu_process.c:169
#4  0x0000000000465943 in qemuAgentFree (mon=0x7f213000f5d0)
    at qemu/qemu_agent.c:148
#5  qemuAgentUnref (mon=0x7f213000f5d0) at qemu/qemu_agent.c:167
#6  0x00000000004659f6 in qemuAgentClose (mon=0x7f213000f5d0)
    at qemu/qemu_agent.c:828
#7  0x000000000048cf8e in qemuProcessHandleAgentEOF (agent=0x7f213000f5d0, 
    vm=0x7f213000c360) at qemu/qemu_process.c:130
#8  0x0000000000465bf3 in qemuAgentIO (watch=<value optimized out>, 
    fd=<value optimized out>, events=<value optimized out>, 
    opaque=0x7f213000f5d0) at qemu/qemu_agent.c:715
#9  0x0000003a564486df in virEventPollDispatchHandles ()
    at util/event_poll.c:490
#10 virEventPollRunOnce () at util/event_poll.c:637
#11 0x0000003a56447487 in virEventRunDefaultImpl () at util/event.c:247
#12 0x0000003a56515aed in virNetServerRun (srv=0x11fce20)
    at rpc/virnetserver.c:736
#13 0x0000000000422421 in main (argc=<value optimized out>, 
    argv=<value optimized out>) at libvirtd.c:1615

There is a recursively callback invocation here

 1. On EOF from the agent, the qemuProcessHandleAgentEOF() callback is run which locks virDomainObjPtr.
 2. It then frees the agent which triggers qemuProcessHandleAgentDestroy() which tries to lock virDomainObjPtr again. Hence deadlock.

This could be solved by re-arranging code in HandleAgentEOF() like this


    priv = vm->privateData;
    priv->agent = NULL;

    virDomainObjUnlock(vm);
    qemuDriverUnlock(driver);

    qemuAgentClose(agent);

ie only hold the lock, while you blank out the 'priv->agent' field.
Comment 4 Dave Allan 2012-09-13 10:58:29 EDT
(In reply to comment #3)
> This could be solved by re-arranging code in HandleAgentEOF() like this

Dan, are you planning to submit a patch?
Comment 5 Daniel Berrange 2012-09-13 11:00:24 EDT
Not right now. Would be better if someone who can actually reproduce the problem can test the idea I suggest above & then submit the patch if it is confirmed to work.
Comment 6 Dave Allan 2012-09-13 11:15:06 EDT
Can you test a scratch build when we have one?
Comment 7 GenadiC 2012-09-13 11:27:23 EDT
(In reply to comment #6)
> Can you test a scratch build when we have one?

Yes, I can try, although I don't have a specific scenario that can reproduce the problem
Comment 8 Michal Privoznik 2012-09-14 07:47:17 EDT
So I've spin a scratch build. Can you please give it a try?

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4867654
Comment 9 GenadiC 2012-09-16 05:09:03 EDT
(In reply to comment #8)
> So I've spin a scratch build. Can you please give it a try?
> 
> http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4867654

I installed the attached build and was not managed to reproduce the problem
Comment 10 Michal Privoznik 2012-09-17 08:05:12 EDT
What a great news. Patch proposed upstream then:

https://www.redhat.com/archives/libvir-list/2012-September/msg01165.html
Comment 11 Michal Privoznik 2012-09-18 03:33:29 EDT
Patch pushed upstream, moving to POST:

commit 1020a5041b0eb575f65b53cb1ca9cee2447a50cd
Author:     Michal Privoznik <mprivozn@redhat.com>
AuthorDate: Fri Sep 14 10:53:00 2012 +0200
Commit:     Michal Privoznik <mprivozn@redhat.com>
CommitDate: Tue Sep 18 09:24:06 2012 +0200

    qemu: Avoid deadlock on HandleAgentEOF
    
    On agent EOF the qemuProcessHandleAgentEOF() callback is called
    which locks virDomainObjPtr. Then qemuAgentClose() is called
    (with domain object locked) which eventually calls qemuAgentDispose()
    and qemuProcessHandleAgentDestroy(). This tries to lock the
    domain object again. Hence the deadlock.


v0.10.1-190-g1020a50
Comment 13 Alex Jia 2012-09-19 07:13:09 EDT
I can reproduce the bug on libvirt-0.10.1-2.el6.x86_64:

1. open the first terminal and run the following cmdline(add your domain name and image name)

# for i in `seq 100`;do virsh attach-disk <domain> /var/lib/libvirt/images/<image>.img vda; virsh detach-disk myRHEL6 vda; sleep 2;done

2. open the second terminal and the following cmdline(add your domain name and image name)

Need to prepare guest xml file from bar-1.xml to bar-10.xml and guest name is bar-1..bar-10.

# for i in `seq 10`;do virsh create bar-$i.xml;done
# for i in `seq 10`;do virsh destroy bar-$i;done

And then you will see the following error:

error: Failed to reconnect to the hypervisor
error: no valid connection
error: Cannot recv data: Connection reset by peer

error: Failed to reconnect to the hypervisor
error: no valid connection
error: internal error client socket is closed


It's okay on libvirt-0.10.2-0rc1.el6, I haven't find this error, so move the bug to VERIFIED status.
Comment 14 Michal Privoznik 2012-11-12 08:20:20 EST
*** Bug 875710 has been marked as a duplicate of this bug. ***
Comment 17 errata-xmlrpc 2013-02-21 02:23:43 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html

Note You need to log in before you can comment on or make changes to this bug.