Bug 964358 - virSetUIDGID can deadlock due to unsafe use of getpwuid_r
virSetUIDGID can deadlock due to unsafe use of getpwuid_r
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: libvirt (Show other bugs)
18
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Eric Blake
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-17 23:09 EDT by Eric Blake
Modified: 2013-08-14 22:55 EDT (History)
10 users (show)

See Also:
Fixed In Version: libvirt-0.10.2.7-1.fc18
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 964359 (view as bug list)
Environment:
Last Closed: 2013-08-14 22:55:10 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Eric Blake 2013-05-17 23:09:14 EDT
Description of problem:
I hit an (admittedly rare) deadlock today where a multithreaded parent spawned off a child, then the child called getpwuid_r.  POSIX says that getpwuid_r is NOT async-signal-safe, and therefore unsafe to be used between fork and exec, precisely because of the possibility of the deadlock I hit.  Libvirt needs to rewrite virSetUIDGID to take the additional information learned by getpwuid_r as a parameter learned in advance, prior to the fork, rather than learning it inline after the fork.

Version-Release number of selected component (if applicable):
Deadlock hit me on latest libvirt.git, but git blame says that virSetUIDGID has been calling getpwuid_r since at least 2011, so I'm raising it against Fedora 18.

How reproducible:
Extremely rare

Steps to Reproduce:
1. Start a VM, and happen to hit the small window where the parent process holds the getpwuid_r mutex in one thread while another thread tries to fork and exec.  While I happened to get (extremely) (un-)lucky to hit it in real life, it will probably be a lot easier if you use gdb to help.
2.
3.
  
Actual results:
Deadlock.
Parent process hung with the following backtrace:
(gdb) t a a bt

Thread 2 (Thread 0x7fd56bbf2700 (LWP 19367)):
#0  0x00007fd5761ec54d in read () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007fd578aace46 in saferead (fd=20, buf=0x7fd56bbf1117, count=1)
    at util/virfile.c:926
#2  0x00007fd578a9e97a in virCommandHandshakeWait (cmd=0x7fd55c231f10)
    at util/vircommand.c:2463
#3  0x00007fd56326ff40 in qemuProcessStart (conn=0x7fd53c000df0, 
    driver=0x7fd55c0dc4f0, vm=0x7fd54800b100, migrateFrom=0x0, stdin_fd=-1, 
    stdin_path=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, 
    flags=1) at qemu/qemu_process.c:3729
#4  0x00007fd5632af536 in qemuDomainObjStart (conn=0x7fd53c000df0, 
    driver=0x7fd55c0dc4f0, vm=0x7fd54800b100, flags=0)
    at qemu/qemu_driver.c:5453
#5  0x00007fd5632af796 in qemuDomainCreateWithFlags (dom=0x7fd55c307d30, 
    flags=0) at qemu/qemu_driver.c:5502
#6  0x00007fd5632af829 in qemuDomainCreate (dom=0x7fd55c307d30)
    at qemu/qemu_driver.c:5520
#7  0x00007fd578b91569 in virDomainCreate (domain=0x7fd55c307d30)
    at libvirt.c:8462
#8  0x00007fd5795ee7aa in remoteDispatchDomainCreate (server=0x7fd579d95280, 
    client=0x7fd579da5010, msg=0x7fd579da87c0, rerr=0x7fd56bbf1aa0, 
    args=0x7fd55c307d70) at remote_dispatch.h:2910
---Type <return> to continue, or q <return> to quit---
#9  0x00007fd5795ee6a0 in remoteDispatchDomainCreateHelper (
    server=0x7fd579d95280, client=0x7fd579da5010, msg=0x7fd579da87c0, 
    rerr=0x7fd56bbf1aa0, args=0x7fd55c307d70, ret=0x7fd55c305430)
    at remote_dispatch.h:2888
#10 0x00007fd578bfdbfa in virNetServerProgramDispatchCall (
    prog=0x7fd579d91cf0, server=0x7fd579d95280, client=0x7fd579da5010, 
    msg=0x7fd579da87c0) at rpc/virnetserverprogram.c:439
#11 0x00007fd578bfd771 in virNetServerProgramDispatch (prog=0x7fd579d91cf0, 
    server=0x7fd579d95280, client=0x7fd579da5010, msg=0x7fd579da87c0)
    at rpc/virnetserverprogram.c:305
#12 0x00007fd578c03ff5 in virNetServerProcessMsg (srv=0x7fd579d95280, 
    client=0x7fd579da5010, prog=0x7fd579d91cf0, msg=0x7fd579da87c0)
    at rpc/virnetserver.c:162
#13 0x00007fd578c040e4 in virNetServerHandleJob (jobOpaque=0x7fd579da91d0, 
    opaque=0x7fd579d95280) at rpc/virnetserver.c:183
#14 0x00007fd578ae3ab6 in virThreadPoolWorker (opaque=0x7fd579d731a0)
    at util/virthreadpool.c:144
#15 0x00007fd578ae347b in virThreadHelper (data=0x7fd579d73110)
    at util/virthreadpthread.c:161
#16 0x00007fd5761e5851 in start_thread (arg=0x7fd56bbf2700)
    at pthread_create.c:301
#17 0x00007fd575b2c90d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Thread 1 (Thread 0x7fd579597800 (LWP 19341)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007fd578ae32a1 in virCondWait (c=0x7fd579d95490, m=0x7fd579d95438)
    at util/virthreadpthread.c:117
#2  0x00007fd578ae3f84 in virThreadPoolFree (pool=0x7fd579d95400)
    at util/virthreadpool.c:264
#3  0x00007fd578c06b93 in virNetServerDispose (obj=0x7fd579d95280)
    at rpc/virnetserver.c:1168
#4  0x00007fd578acecb4 in virObjectUnref (anyobj=0x7fd579d95280)
    at util/virobject.c:264
#5  0x00007fd5795e5031 in main (argc=1, argv=0x7fff3eaec128) at libvirtd.c:1500



Child process hung with the following backtrace:
(gdb) t a a bt

Thread 1 (Thread 0x7fd56bbf2700 (LWP 3212)):
#0  __lll_lock_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x00007fd5761e7388 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007fd5761e7257 in __pthread_mutex_lock (mutex=0x7fd56be00360)
    at pthread_mutex_lock.c:61
#3  0x00007fd56bbf9fc5 in _nss_files_getpwuid_r (uid=0, result=0x7fd56bbf0c70, 
    buffer=0x7fd55c2a65f0 "", buflen=1024, errnop=0x7fd56bbf25b8)
    at nss_files/files-pwd.c:40
#4  0x00007fd575aeff1d in __getpwuid_r (uid=0, resbuf=0x7fd56bbf0c70, 
    buffer=0x7fd55c2a65f0 "", buflen=1024, result=0x7fd56bbf0cb0)
    at ../nss/getXXbyYY_r.c:253
#5  0x00007fd578aebafc in virSetUIDGID (uid=0, gid=0) at util/virutil.c:1031
#6  0x00007fd578aebf43 in virSetUIDGIDWithCaps (uid=0, gid=0, capBits=0, 
    clearExistingCaps=true) at util/virutil.c:1388
#7  0x00007fd578a9a20b in virExec (cmd=0x7fd55c231f10) at util/vircommand.c:654
#8  0x00007fd578a9dfa2 in virCommandRunAsync (cmd=0x7fd55c231f10, pid=0x0)
    at util/vircommand.c:2247
#9  0x00007fd578a9d74e in virCommandRun (cmd=0x7fd55c231f10, exitstatus=0x0)
    at util/vircommand.c:2100
#10 0x00007fd56326fde5 in qemuProcessStart (conn=0x7fd53c000df0, 
    driver=0x7fd55c0dc4f0, vm=0x7fd54800b100, migrateFrom=0x0, stdin_fd=-1, 
---Type <return> to continue, or q <return> to quit---
    stdin_path=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, 
    flags=1) at qemu/qemu_process.c:3694
#11 0x00007fd5632af536 in qemuDomainObjStart (conn=0x7fd53c000df0, 
    driver=0x7fd55c0dc4f0, vm=0x7fd54800b100, flags=0)
    at qemu/qemu_driver.c:5453
#12 0x00007fd5632af796 in qemuDomainCreateWithFlags (dom=0x7fd55c307d30, 
    flags=0) at qemu/qemu_driver.c:5502
#13 0x00007fd5632af829 in qemuDomainCreate (dom=0x7fd55c307d30)
    at qemu/qemu_driver.c:5520
#14 0x00007fd578b91569 in virDomainCreate (domain=0x7fd55c307d30)
    at libvirt.c:8462
#15 0x00007fd5795ee7aa in remoteDispatchDomainCreate (server=0x7fd579d95280, 
    client=0x7fd579da5010, msg=0x7fd579da87c0, rerr=0x7fd56bbf1aa0, 
    args=0x7fd55c307d70) at remote_dispatch.h:2910
#16 0x00007fd5795ee6a0 in remoteDispatchDomainCreateHelper (
    server=0x7fd579d95280, client=0x7fd579da5010, msg=0x7fd579da87c0, 
    rerr=0x7fd56bbf1aa0, args=0x7fd55c307d70, ret=0x7fd55c305430)
    at remote_dispatch.h:2888
#17 0x00007fd578bfdbfa in virNetServerProgramDispatchCall (
    prog=0x7fd579d91cf0, server=0x7fd579d95280, client=0x7fd579da5010, 
    msg=0x7fd579da87c0) at rpc/virnetserverprogram.c:439
#18 0x00007fd578bfd771 in virNetServerProgramDispatch (prog=0x7fd579d91cf0, 
    server=0x7fd579d95280, client=0x7fd579da5010, msg=0x7fd579da87c0)
---Type <return> to continue, or q <return> to quit---
    at rpc/virnetserverprogram.c:305
#19 0x00007fd578c03ff5 in virNetServerProcessMsg (srv=0x7fd579d95280, 
    client=0x7fd579da5010, prog=0x7fd579d91cf0, msg=0x7fd579da87c0)
    at rpc/virnetserver.c:162
#20 0x00007fd578c040e4 in virNetServerHandleJob (jobOpaque=0x7fd579da91d0, 
    opaque=0x7fd579d95280) at rpc/virnetserver.c:183
#21 0x00007fd578ae3ab6 in virThreadPoolWorker (opaque=0x7fd579d731a0)
    at util/virthreadpool.c:144
#22 0x00007fd578ae347b in virThreadHelper (data=0x7fd579d73110)
    at util/virthreadpthread.c:161
#23 0x00007fd5761e5851 in start_thread (arg=0x7fd56bbf2700)
    at pthread_create.c:301
#24 0x00007fd575b2c90d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115



Expected results:
No deadlock - getpwuid_r must be called BEFORE fork, and the results cached and passed in as a parameter to virSetUIDGID after fork, so that we aren't using non-async-signal-safe functions that deadlock the child, which in turn prevents us from locking up the parent.

Additional info:
In searching for duplicates, I found that bug 452629 described similar symptoms, but without a root cause analysis to the POSIX async-signal safety aspect.
Comment 1 Laine Stump 2013-05-18 01:23:19 EDT
This BZ was cloned to Bug 964359 for RHEL.

(I removed the automatically added "Depends on" from the clone, which removed the automatically added "Blocks from this BZ. It is correct that the two bugs don't have a depends/block relationship, but eliminating it removes all reference of the other BZ from this one, which also isn't good. This comment restores the connection.)
Comment 2 Eric Blake 2013-05-21 23:26:35 EDT
Initial upstream attempt at patching this:
https://www.redhat.com/archives/libvir-list/2013-May/msg01600.html
Comment 3 Cole Robinson 2013-06-11 18:14:32 EDT
Eric, did this ever get a repost? I can't find it but maybe the patch topic changed.
Comment 4 Laine Stump 2013-06-12 07:01:16 EDT
I raised a potential issue in the review that ended up being valid, and I believe he was trying to get the license relaxed for a gnulib function he wanted to use as a solution:

  https://www.redhat.com/archives/libvir-list/2013-May/msg01646.html

I couldn't find a patch either, and he's on vacation now.
Comment 5 Eric Blake 2013-06-12 16:18:19 EDT
Repost is still pending once the gnulib license issues are sorted.  I had hoped to clean it up before my vacation began, but was missing confirmation from one original author at the time; now it will more likely be the last week in June before I repost.
Comment 6 Eric Blake 2013-07-09 21:18:24 EDT
Another attempt at the upstream patch:
https://www.redhat.com/archives/libvir-list/2013-July/msg00566.html
Comment 7 Eric Blake 2013-07-12 17:24:36 EDT
Fixed for qemu upstream as of:

commit ee777e994927ed5f2d427fbc5a53cbe8b5969bda
Author: Eric Blake <eblake@redhat.com>
Date:   Tue May 21 20:59:10 2013 -0600

    util: make virSetUIDGID async-signal-safe
    
    https://bugzilla.redhat.com/show_bug.cgi?id=964358
    
    POSIX states that multi-threaded apps should not use functions
    that are not async-signal-safe between fork and exec, yet we
    were using getpwuid_r and initgroups.  Although rare, it is
    possible to hit deadlock in the child, when it tries to grab
    a mutex that was already held by another thread in the parent.
    I actually hit this deadlock when testing multiple domains
    being started in parallel with a command hook, with the following
    backtrace in the child:

...

Still needs additional fixes for lxc:
https://www.redhat.com/archives/libvir-list/2013-July/msg00849.html
https://www.redhat.com/archives/libvir-list/2013-July/msg00853.html
Comment 8 Eric Blake 2013-08-01 18:05:15 EDT
I've backported all the patches required to the v1.0.5-maint branch (F19) and the v0.10.2-maint branch (F18); the next maint release that Cole cuts from each branch will include the fix.
Comment 9 Fedora Update System 2013-08-01 19:50:56 EDT
libvirt-0.10.2.7-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/libvirt-0.10.2.7-1.fc18
Comment 10 Fedora Update System 2013-08-02 17:48:35 EDT
Package libvirt-0.10.2.7-1.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing libvirt-0.10.2.7-1.fc18'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-14066/libvirt-0.10.2.7-1.fc18
then log in and leave karma (feedback).
Comment 11 Fedora Update System 2013-08-14 22:55:10 EDT
libvirt-0.10.2.7-1.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.