Bug 1509088

Summary:

autofs hangs due to race condition in do_spawn

Product:

Red Hat Enterprise Linux 7

Reporter:

Michael Fenn <Michael.Fenn>

Component:

autofs

Assignee:

Ian Kent <ikent>

Status:

CLOSED ERRATA

QA Contact:

xiaoli feng <xifeng>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

7.3

CC:

ikent, xifeng, xzhou

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

autofs-5.0.7-80.el7

Doc Type:

Bug Fix

Doc Text:

Cause: There can be a finite time between performing an open and setting the descriptor close on exec and during this window the opened file descriptor can leak into execv(3) executed commands. Consequence: When this happens it leads to two problems, one is that file descriptors can leak into forked processes which selinux prohibits. The second is a leaked file descriptor can cause poll(2) to not return as it should leading to a hang when running such things as a mount. Fix: Mutual exclusion has been added to ensure that file handle opens are either completed before or not started during the fork(2) when these processes are created. Result: No file handle leaks into sub-processes and executed programs no longer occasionally hang.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-04-10 18:18:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1527815

Bug Blocks:

Attachments:

Description	Flags
Core of stuck automount process	none
Proposed fix using pipe2	none
Patch - update configure to check for pipe2(2)	none
Patch - fix open calls not using open_xxxx() calls	none
Patch - move open_xxxx() functions to spawn.c	none
Patch - serialize calls to open_xxxx() functions	none

Description Michael Fenn 2017-11-02 23:50:48 UTC

Created attachment 1347127 [details]
Core of stuck automount process

Description of problem:
Automount is multi-threaded and can mount two filesystems at the same time in different threads, and there is a race condition between these threads:

Consider two threads in do_spawn, and one of them (T1) is in open_pipe, *between* the call to pipe and fcntl(..., FD_CLOEXEC) and the other one (T2) calls exec("mount", ..."

At the moment that exec is called, the socketpair in T1 contains two open file descriptors, neither of which is CLOEXEC.  So those file descriptors stay open in the mount corresponding to T2.   Nobody in the T2 stack (mount, which execs a FUSE mount helper, which daemonizes itself) ever closes those file descriptors.

Now return to T1.  Because do_spawn gets fancy with capturing and logging stderr, the parent branch of the fork() in T1 jumps through hoops to ignore SIGCHLD, and to do 'timed_reads' on T1's mount's stdout/stderr.  Those timed reads keep reading until poll returns a POLLHUP.  That's supposed to happen when T1's mount process terminates (or when it closes stdout and stderr).  BUT now, T2 also has the other end of the pipe open.  So T1's timed_read never gets the POLLHUP that would tell it to quit.  It's stuck forever.

Normally this race is fairly harmless because the mount helper will exit, taking all its file descriptors with it.  However, in the case where the mount helper forks off a daemon (e.g. a FUSE filesystem), the file descriptors "inherited" by the daemon may never get closed, leading to a defunct mount process.

root     18819     1  0 Oct16 ?        00:00:00 /bin/bash --norc -c ...
root     18827  1279  0 Oct16 ?        00:00:00 [mount] <defunct>
root     18835     1  0 Oct16 ?        00:00:05 /sbin/mount.fusedaemon

When automount is wedged in this way, it's even worse than just a zombie mount process hanging around. The whole autofs subsystem is wedged.  Here's a stack trace for the stuck thread (mount.fs123p7 is the fuse daemon):

Thread 16 (Thread 0x7f9ab8cf2700 (LWP 18821)):
#0  0x00007f9addee6dfd in ?? () from /lib64/libc.so.6
#1  0x0000000000000001 in ?? ()
#2  0x00007f9adf4d40fe in poll (__timeout=<optimized out>, __nfds=1,
    __fds=0x7f9ab8cec690) at /usr/include/bits/poll2.h:46
#3  timed_read (time=-1, len=2047,
    buf=0x7f9ab8cec820 "mountpoint=/gdn/centos7/0001, multithreaded=1, foreground=0", pipe=85) at spawn.c:107
#4  do_spawn (logopt=logopt@entry=0, wait=wait@entry=4294967295,
    options=options@entry=2, prog=prog@entry=0x7f9ab8ced170 "/usr/bin/mount",
    argv=argv@entry=0x7f9ab8ced070) at spawn.c:272
---Type <return> to continue, or q <return> to quit---
#5  0x00007f9adf4d4a66 in spawn_mount (logopt=0) at spawn.c:420
#6  0x00007f9a63dfd3d4 in mount_mount (ap=0x7f9adfaef930,
    root=0x7f9adfaefa20 "/gdn/centos7", name=<optimized out>, name_len=4,
    what=0x7f9ab8cef390 "http://gardenweb.en.desres.deshaw.com/prod/gdn/centos7/0001", fstype=0x7f9ab8cef400 "fs123p7",
    options=0x7f9ab8cef420 "ro,noatime,allow_other,Fs123CacheDir=/state/partition1/fs-cache/fs123_gdn_centos7_0001,Fs123CacheMaxMBytes=7500,Fs123CacheMaxFiles=500000,Fs123LocalLocks=1,Fs123StaleIfError=864000,Fs123TruncateTo32Bi"...,
    context=0x462c303035373d73) at mount_generic.c:87
#7  0x00007f9adf4d609f in do_mount (ap=ap@entry=0x7f9adfaef930,
    root=0x7f9adfaefa20 "/gdn/centos7", name=name@entry=0x7f9ab8cef3e0 "0001",
    name_len=4,
    what=0x7f9ab8cef390 "http://gardenweb.en.desres.deshaw.com/prod/gdn/centos7/0001", fstype=fstype@entry=0x7f9ab8cef400 "fs123p7",
    options=options@entry=0x7f9ab8cef420 "ro,noatime,allow_other,Fs123CacheDir=/state/partition1/fs-cache/fs123_gdn_centos7_0001,Fs123CacheMaxMBytes=7500,Fs123CacheMaxFiles=500000,Fs123LocalLocks=1,Fs123StaleIfError=864000,Fs123TruncateTo32Bi"...) at mount.c:78
#8  0x00007f9adb942672 in sun_mount (ap=ap@entry=0x7f9adfaef930,
    root=<optimized out>, name=<optimized out>, namelen=<optimized out>,
    loc=<optimized out>, loclen=<optimized out>, loclen@entry=60,
    options=0x7f9ab8cef420 "ro,noatime,allow_other,Fs123CacheDir=/state/partition1/fs-cache/fs123_gdn_centos7_0001,Fs123CacheMaxMBytes=7500,Fs123CacheMaxFiles=5---Type <return> to continue, or q <return> to quit---
00000,Fs123LocalLocks=1,Fs123StaleIfError=864000,Fs123TruncateTo32Bi"...,
    ctxt=0x7f9a98000d90) at parse_sun.c:711
#9  0x00007f9adb944c01 in parse_mount (ap=0x7f9adfaef930,
    name=<optimized out>, name_len=<optimized out>, mapent=<optimized out>,
    context=<optimized out>) at parse_sun.c:1672
#10 0x00007f9adaad86aa in lookup_mount (ap=<optimized out>,
    name=<optimized out>, name_len=4, context=0x7f9a980008c0)
    at lookup_program.c:691
#11 0x00007f9adf4d775d in do_lookup_mount (ap=ap@entry=0x7f9adfaef930,
    map=0x7f9aa8000910, name=name@entry=0x7f9ab8cefdc0 "0001",
    name_len=name_len@entry=4) at lookup.c:766
#12 0x00007f9adf4d7b89 in lookup_name_file_source_instance (
    ap=ap@entry=0x7f9adfaef930, map=map@entry=0x7f9adfaefa40,
    name=name@entry=0x7f9ab8cefdc0 "0001", name_len=name_len@entry=4)
    at lookup.c:877
#13 0x00007f9adf4d842d in lookup_nss_mount (ap=ap@entry=0x7f9adfaef930,
    source=source@entry=0x0, name=name@entry=0x7f9ab8cefdc0 "0001", name_len=4)
    at lookup.c:1128
#14 0x00007f9adf4cef55 in do_mount_indirect (arg=<optimized out>)
    at indirect.c:772
#15 0x00007f9adf089dc5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f9addef173d in clone () from /lib64/libc.so.6

Note in frame 13, the wedged thread is in lookup_nss_mount.  Unfortunately, in frame 14, the kernel is waiting for us to send the ready or fail ioctls.

    info(ap->logopt, "attempting to mount entry %s", buf);

    set_tsd_user_vars(ap->logopt, mt.uid, mt.gid);

    status = lookup_nss_mount(ap, NULL, mt.name, mt.len);
    pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &state);
    if (status) {
        ops->send_ready(ap->logopt,
                ap->ioctlfd, mt.wait_queue_token);
        info(ap->logopt, "mounted %s", buf);
    } else {
        /* TODO: get mount return status from lookup_nss_mount */
        ops->send_fail(ap->logopt,
                   ap->ioctlfd, mt.wait_queue_token, -ENOENT);
        info(ap->logopt, "failed to mount %s", buf);
    }

If blocked thread never calls send_ready, the kernel side never releases the lock that blocks *all* subsequent attempts to use autofs.  So the race condition between T1 and T2 locks up everything else on the system. 

We have a local workaround (we think) by modifying our FUSE daemon to close all open file descriptors.  However, automount should be resilient to this.

Version-Release number of selected component (if applicable):

# rpm -q autofs
autofs-5.0.7-56.el7.x86_64

How reproducible:

Ask autofs to mount two filesystems at nearly the same time.

Steps to Reproduce:
1.
2.
3.

Actual results:

Both filesystems mount.

Expected results:

One filesystem doesn't mount, mount process remains running, all access to autofs hangs.

Additional info:

Comment 2 Michael Fenn 2017-11-03 14:42:10 UTC

Created attachment 1347413 [details]
Proposed fix using pipe2

Attached a proposed patch to use pipe2().  We noticed that the preprocessor macros surrounding the pipe2() logic don't have the expected results on systems with pipe2().  This patch fixes that logic and puts in a workaround for systems that don't have pipe2().

Comment 3 Ian Kent 2017-11-13 01:17:52 UTC

(In reply to Michael Fenn from comment #0)
> Created attachment 1347127 [details]
> Core of stuck automount process
> 
> Description of problem:
> Automount is multi-threaded and can mount two filesystems at the same time
> in different threads, and there is a race condition between these threads:
> 
> Consider two threads in do_spawn, and one of them (T1) is in open_pipe,
> *between* the call to pipe and fcntl(..., FD_CLOEXEC) and the other one (T2)
> calls exec("mount", ..."
> 
> At the moment that exec is called, the socketpair in T1 contains two open
> file descriptors, neither of which is CLOEXEC.  So those file descriptors
> stay open in the mount corresponding to T2.   Nobody in the T2 stack (mount,
> which execs a FUSE mount helper, which daemonizes itself) ever closes those
> file descriptors.
> 
> Now return to T1.  Because do_spawn gets fancy with capturing and logging
> stderr, the parent branch of the fork() in T1 jumps through hoops to ignore
> SIGCHLD, and to do 'timed_reads' on T1's mount's stdout/stderr.  Those timed
> reads keep reading until poll returns a POLLHUP.  That's supposed to happen
> when T1's mount process terminates (or when it closes stdout and stderr). 
> BUT now, T2 also has the other end of the pipe open.  So T1's timed_read
> never gets the POLLHUP that would tell it to quit.  It's stuck forever.
> 
> Normally this race is fairly harmless because the mount helper will exit,
> taking all its file descriptors with it.  However, in the case where the
> mount helper forks off a daemon (e.g. a FUSE filesystem), the file
> descriptors "inherited" by the daemon may never get closed, leading to a
> defunct mount process.
> 
> root     18819     1  0 Oct16 ?        00:00:00 /bin/bash --norc -c ...
> root     18827  1279  0 Oct16 ?        00:00:00 [mount] <defunct>
> root     18835     1  0 Oct16 ?        00:00:05 /sbin/mount.fusedaemon
> 
> When automount is wedged in this way, it's even worse than just a zombie
> mount process hanging around. The whole autofs subsystem is wedged.  Here's
> a stack trace for the stuck thread (mount.fs123p7 is the fuse daemon):
>

snip ...
 
> 
> Note in frame 13, the wedged thread is in lookup_nss_mount.  Unfortunately,
> in frame 14, the kernel is waiting for us to send the ready or fail ioctls.
> 
>     info(ap->logopt, "attempting to mount entry %s", buf);
> 
>     set_tsd_user_vars(ap->logopt, mt.uid, mt.gid);
> 
>     status = lookup_nss_mount(ap, NULL, mt.name, mt.len);
>     pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &state);
>     if (status) {
>         ops->send_ready(ap->logopt,
>                 ap->ioctlfd, mt.wait_queue_token);
>         info(ap->logopt, "mounted %s", buf);
>     } else {
>         /* TODO: get mount return status from lookup_nss_mount */
>         ops->send_fail(ap->logopt,
>                    ap->ioctlfd, mt.wait_queue_token, -ENOENT);
>         info(ap->logopt, "failed to mount %s", buf);
>     }
> 
> If blocked thread never calls send_ready, the kernel side never releases the
> lock that blocks *all* subsequent attempts to use autofs.  So the race
> condition between T1 and T2 locks up everything else on the system. 

That might be a separate problem.
 
> 
> We have a local workaround (we think) by modifying our FUSE daemon to close
> all open file descriptors.  However, automount should be resilient to this.
> 

Yes it should.

These functions were added (contributed) to prevent these file
descriptors leaking into exec'ed processes but as you describe
there is a race in them.

But the problem exists in all of the open_setmntent_r(),
open_fopen_r(), open_fd_mode() and open_fd() as well as in
open_pipe().

And I'm not sure that calling close(2) on several thousand
file descriptors on every exec is a good approach to to
resolving the problem either.

I'm inclined to alter these functions to ensure the file
handles are all properly set to close on exec instead. That
probably means moving them out of the include file and adding
a mutex to serialize accesses.

Your description also appears to imply that __have_pipe2 no
longer does what it's expected to do.

I'm not sure about that either so perhaps there's a need to
update configure.in in the way you describe to replace the
usage of __have_pipe2.

Ian

Comment 4 Michael Fenn 2017-11-13 03:27:12 UTC

> And I'm not sure that calling close(2) on several thousand
> file descriptors on every exec is a good approach to to
> resolving the problem either.
> 
> I'm inclined to alter these functions to ensure the file
> handles are all properly set to close on exec instead. That
> probably means moving them out of the include file and adding
> a mutex to serialize accesses.

For RHEL specifically, both el6 and el7 have pipe2(2) support, so the close(2) loop could be dropped entirely.  Wrapping them in a mutex would also work.

> 
> Your description also appears to imply that __have_pipe2 no
> longer does what it's expected to do.
> 
> I'm not sure about that either so perhaps there's a need to
> update configure.in in the way you describe to replace the
> usage of __have_pipe2.
> 
> Ian


As far as I can tell, __have_pipe2 was introduced in glibc in 2008 but was really meant for glibc's internal use, testing for pipe2(2) should be done in the usual way with autoconf.  The __have_pipe2 macro was changed in 2012, which is probably he source of the regression in el7, and was just recently removed altogether.  The attached patch includes a change to configure.in which should test for pipe2(2).

Glibc changelogs below:

2008-07-27  Ulrich Drepper  <drepper>

...
    * sysdeps/unix/sysv/linux/syscalls.list: Add __pipe2 alias.
...


2012-08-18  Mike Frysinger  <vapier>

    * include/sys/socket.h (__have_sock_cloexec): Add attribute_hidden.
    * include/unistd.h (__have_sock_cloexec): Likewise.
    (__have_pipe2): Likewise.
    (__have_dup3): Likewise.

...

2012-08-18  Mike Frysinger  <vapier>

    [BZ #9685]
    * include/unistd.h (__have_pipe2): Change define into an extern int.
 
...
 
2017-04-18  Florian Weimer  <fweimer>

    * include/unistd.h (__have_pipe2): Remove declaration.
    * socket/have_sock_cloexec.c (__have_pipe2): Remove definition.
    * libio/iopopen.c (_IO_new_proc_open): Assume that pipe2 is
    available.
    * posix/wordexp.c (exec_comm_child, exec_comm): Likewise.
    * sysdeps/unix/sysv/linux/kernel-features.h (__ASSUME_PIPE2):
    Remove definition.

Comment 5 Ian Kent 2017-11-13 05:07:03 UTC

(In reply to Michael Fenn from comment #4)
> > And I'm not sure that calling close(2) on several thousand
> > file descriptors on every exec is a good approach to to
> > resolving the problem either.
> > 
> > I'm inclined to alter these functions to ensure the file
> > handles are all properly set to close on exec instead. That
> > probably means moving them out of the include file and adding
> > a mutex to serialize accesses.
> 
> For RHEL specifically, both el6 and el7 have pipe2(2) support, so the
> close(2) loop could be dropped entirely.  Wrapping them in a mutex would
> also work.

I think so and it should also satisfy the no leaked file
descriptors on exec selinux requirement for other descriptors.

I'm wrapping the entire functions with a mutex as fork(2)
probably needs to either block if a descriptor in being
opened or the open_*() functions need to be blocked until
fork(2) is complete.

> 
> > 
> > Your description also appears to imply that __have_pipe2 no
> > longer does what it's expected to do.
> > 
> > I'm not sure about that either so perhaps there's a need to
> > update configure.in in the way you describe to replace the
> > usage of __have_pipe2.
> > 
> > Ian
> 
> 
> As far as I can tell, __have_pipe2 was introduced in glibc in 2008 but was
> really meant for glibc's internal use, testing for pipe2(2) should be done
> in the usual way with autoconf.  The __have_pipe2 macro was changed in 2012,
> which is probably he source of the regression in el7, and was just recently
> removed altogether.  The attached patch includes a change to configure.in
> which should test for pipe2(2).

Yes, the then glibc maintainer that wrote these probably
didn't expect these to go away, I'm adding the configure
fragment.

I'm working on this now and the only question I have (to myself)
is whether I will need to use fork handlers but I think probably
not as the child process should have already been cloned when
fork(2) returns.

Ian

Comment 6 Michael Fenn 2017-11-13 16:56:04 UTC

(In reply to Ian Kent from comment #5)
> (In reply to Michael Fenn from comment #4)

> I'm working on this now and the only question I have (to myself)
> is whether I will need to use fork handlers but I think probably
> not as the child process should have already been cloned when
> fork(2) returns.
> 
> Ian

Great, thanks for taking a look!

Comment 7 Ian Kent 2017-12-07 06:32:07 UTC

Created attachment 1364053 [details]
Patch - update configure to check for pipe2(2)

Comment 8 Ian Kent 2017-12-07 06:33:12 UTC

Created attachment 1364054 [details]
Patch - fix open calls not using open_xxxx() calls

Comment 9 Ian Kent 2017-12-07 06:34:16 UTC

Created attachment 1364055 [details]
Patch - move open_xxxx() functions to spawn.c

Comment 10 Ian Kent 2017-12-07 06:35:18 UTC

Created attachment 1364056 [details]
Patch - serialize calls to open_xxxx() functions

Comment 11 Ian Kent 2017-12-07 06:54:33 UTC

An autofs package with the above changes has been built
and is available at:
http://people.redhat.com/~ikent/autofs-5.0.7-77.bz1509088.1.el7/

I believe this package will resolve the problem reported
in this bug.

Could you test it and let me know how it goes please.
Ian

Comment 12 Michael Fenn 2017-12-08 15:01:07 UTC

Thanks Ian.  I've been working on a synthetic test to reproduce the race condition outside of our production environment, but so far I haven't been successful. I'll keep working on it.

Comment 16 Michael Fenn 2017-12-15 17:55:28 UTC

I deployed the patched version into our production environment on 2017-12-13 and reverted the workaround in our fuse filesystem.  So far we haven't seen any more hangs.

Comment 21 errata-xmlrpc 2018-04-10 18:18:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0977