Bug 1487322 - Deadlock on dmn object since ace45e67
Summary: Deadlock on dmn object since ace45e67
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: All
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Michal Privoznik
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-31 15:43 UTC by Christian Ehrhardt
Modified: 2017-09-01 11:34 UTC (History)
3 users (show)

Fixed In Version: libvirt-3.7.0
Clone Of:
Environment:
Last Closed: 2017-09-01 11:34:07 UTC
Embargoed:


Attachments (Terms of Use)

Description Christian Ehrhardt 2017-08-31 15:43:25 UTC
Description of problem:


Version-Release number of selected component (if applicable):
- Current Head (3.6 and 3.7-rc*)

How reproducible:
- takes a while but close to 100%

Steps to Reproduce:
- Run concurrent create/start/stop/destroy loops
- Openstack reboot --hard
- probably others
- Different details are at pad.lv/1714254

Actual results:
- libvirtd hangs up on a a deadlock

Expected results:
- libvirtd continues to work

Additional info:
The change introducing the issue is
https://libvirt.org/git/?p=libvirt.git;a=commit;h=ace45e67abbd9a033be54602db71a9dbc71408d7
Author: Michal Privoznik <mprivozn>
Date:   Tue Aug 1 11:04:22 2017 +0200

    virNetDaemonCallInhibit: Call virNetDaemonGotInhibitReply properly

Since that there is a race that can happen and cause the execution to lock the same object twice.
Since pthread_mutex lock on the default locks behaves works without error check - the second lock is a deadlock.

On a stacktrace like the following:
Thread 3 (Thread 0x7f07e941e700 (LWP 30193)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f07f4891dbd in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55e8a5ae4aa0) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f07f51d9c45 in virMutexLock (m=m@entry=0x55e8a5ae4aa0) at ../../../src/util/virthread.c:89
#3  0x00007f07f51ba0ca in virObjectLock (anyobj=anyobj@entry=0x55e8a5ae4a90) at ../../../src/util/virobject.c:388
#4  0x00007f07f52f7558 in virNetDaemonGotInhibitReply (pending=0x7f07b8018fc0, opaque=opaque@entry=0x55e8a5ae4a90) at ../../../src/rpc/virnetdaemon.c:449
#5  0x00007f07f52f83e7 in virNetDaemonCallInhibit (what=0x7f07f53a6419 "shutdown", mode=0x7f07f53d8ce2 "delay", why=0x7f07f53f99a0 "Virtual machines need to be saved", who=0x7f07f53f9b46 "Libvirt", dmn=0x55e8a5ae4a90) at ../../../src/rpc/virnetdaemon.c:519
#6  virNetDaemonAddShutdownInhibition (dmn=0x55e8a5ae4a90) at ../../../src/rpc/virnetdaemon.c:541
#7  0x00007f07b2bbe4d2 in qemuProcessInit (driver=driver@entry=0x7f07a01e1b70, vm=vm@entry=0x7f07ac003360, updatedCPU=updatedCPU@entry=0x0, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_START, migration=migration@entry=false, flags=flags@entry=17) at ../../../src/qemu/qemu_process.c:4726
#8  0x00007f07b2bc1313 in qemuProcessStart (conn=conn@entry=0x7f07bc000c70, driver=driver@entry=0x7f07a01e1b70, vm=vm@entry=0x7f07ac003360, updatedCPU=updatedCPU@entry=0x0, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_START, migrateFrom=migrateFrom@entry=0x0, migrateFd=-1, migratePath=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=17) at ../../../src/qemu/qemu_process.c:5850
#9  0x00007f07b2c1f7ff in qemuDomainObjStart (conn=0x7f07bc000c70, driver=driver@entry=0x7f07a01e1b70, vm=0x7f07ac003360, flags=flags@entry=0, asyncJob=QEMU_ASYNC_JOB_START) at ../../../src/qemu/qemu_driver.c:7121
#10 0x00007f07b2c1fef6 in qemuDomainCreateWithFlags (dom=0x7f07b8000b20, flags=0) at ../../../src/qemu/qemu_driver.c:7175
#11 0x00007f07f5295ef1 in virDomainCreateWithFlags (domain=domain@entry=0x7f07b8000b20, flags=0) at ../../../src/libvirt-domain.c:6595
#12 0x000055e8a4e3227a in remoteDispatchDomainCreateWithFlags (server=0x55e8a5ae1200, msg=0x55e8a5ae4b90, ret=0x7f07b8015fe0, args=0x7f07b8000af0, rerr=0x7f07e941dbc0, client=<optimized out>) at ../../../daemon/remote_dispatch.h:4307
#13 remoteDispatchDomainCreateWithFlagsHelper (server=0x55e8a5ae1200, client=<optimized out>, msg=0x55e8a5ae4b90, rerr=0x7f07e941dbc0, args=0x7f07b8000af0, ret=0x7f07b8015fe0) at ../../../daemon/remote_dispatch.h:4282
#14 0x00007f07f52fe599 in virNetServerProgramDispatchCall (msg=0x55e8a5ae4b90, client=0x55e8a5b03840, server=0x55e8a5ae1200, prog=0x55e8a5b27070) at ../../../src/rpc/virnetserverprogram.c:437
#15 virNetServerProgramDispatch (prog=0x55e8a5b27070, server=server@entry=0x55e8a5ae1200, client=0x55e8a5b03840, msg=0x55e8a5ae4b90) at ../../../src/rpc/virnetserverprogram.c:307
#16 0x000055e8a4e53268 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x55e8a5ae1200) at ../../../src/rpc/virnetserver.c:148
#17 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x55e8a5ae1200) at ../../../src/rpc/virnetserver.c:169
#18 0x00007f07f51da71b in virThreadPoolWorker (opaque=opaque@entry=0x55e8a5ad6f60) at ../../../src/util/virthreadpool.c:167
#19 0x00007f07f51d9ac8 in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:206
#20 0x00007f07f488f6ba in start_thread (arg=0x7f07e941e700) at pthread_create.c:333
#21 0x00007f07f45c53dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109


There was a lock at frame #6:
534     virObjectLock(dmn);
But under certain conditions (the race) it later on tried to lock the same in frame #4
449     virObjectLock(dmn);

Comment 1 Christian Ehrhardt 2017-09-01 08:24:06 UTC
There is a suggested patch we will use as hot,fix but wait for a proper discussion/solution to be picked up eventually.

See: https://launchpadlibrarian.net/335315545/avoid-double-locking.patch

Comment 2 Christian Ehrhardt 2017-09-01 08:26:55 UTC
Hmm that is actually just the "partial revert" hotfix, the suggestion I actually meant is at http://paste.ubuntu.com/25439049/

Comment 3 Michal Privoznik 2017-09-01 08:44:57 UTC
I've proposed patch on the upstream list:

https://www.redhat.com/archives/libvir-list/2017-September/msg00015.html

Comment 4 Christian Ehrhardt 2017-09-01 09:50:10 UTC
Change builds fine and seems to fix the issue.
Survived 20 minutes in my stress loop, which it never did before.

Comment 5 Michal Privoznik 2017-09-01 11:34:07 UTC
And I've pushed the patch:

commit 054c6d27214ef90feab3897b052f450492d5ef3f
Author:     Michal Privoznik <mprivozn>
AuthorDate: Fri Sep 1 10:33:12 2017 +0200
Commit:     Michal Privoznik <mprivozn>
CommitDate: Fri Sep 1 13:21:33 2017 +0200

    virnetdaemon: Don't deadlock when talking to D-Bus
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1487322
    
    In ace45e67abbd I tried to fix a problem that we get the reply to
    a D-Bus call while we were sleeping. In that case the callback
    was never set. So I changed the code that the callback is called
    directly in this case. However, I hadn't realized that since the
    callback is called out of order it locks the virNetDaemon.
    Exactly the very same virNetDaemon object that we are dealing
    with right now and that we have locked already (in
    virNetDaemonAddShutdownInhibition())
    
    Signed-off-by: Michal Privoznik <mprivozn>

v3.7.0-rc2-5-g054c6d272


Note You need to log in before you can comment on or make changes to this bug.