Bug 628608 - robust mutex deadlocks instead of returning EOWNERDEAD
Summary: robust mutex deadlocks instead of returning EOWNERDEAD
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 13
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-08-30 14:04 UTC by Maxim Egorushkin
Modified: 2011-07-16 13:37 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-12 06:19:58 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
test source code (same as in the post) (7.07 KB, text/plain)
2010-08-31 22:09 UTC, Maxim Egorushkin
no flags Details
updated test source code (7.08 KB, text/x-c++src)
2010-09-01 12:03 UTC, Maxim Egorushkin
no flags Details

Description Maxim Egorushkin 2010-08-30 14:04:59 UTC
Description of problem:
pthread_mutex_lock() deadlocks on an abandoned (locked by a departed process) priority inversion protected process shared robust mutex.

Version-Release number of selected component (if applicable):
glibc-2.12-2.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. Fork a child process that creates a priority inversion protected process shared robust mutex in a mapped file. It then locks the mutex and terminates.
2. After child process died, the parent process process maps the file and locks the mutex.
  
Actual results:
The parent process blocks forever in pthread_mutex_lock().

Expected results:
pthread_mutex_lock() should return EOWNERDEAD.

Additional info:

pthread_mutex_lock() deadlocks on an abandoned (locked by a departed process) priority inversion protected process shared robust mutex.

Stracing shows that pthread_mutex_lock() on the abandoned mutex does this:

futex(0x7f32ce325000, FUTEX_LOCK_PI, 1) = -1 ESRCH (No such process)

And stepping through glibc code in /usr/src/debug/glibc-2.12-22-g6f8d0c6/nptl/pthread_mutex_lock.c leads to the following block of code:

if (oldval != 0)
  {
    /* The mutex is locked.  The kernel will now take care of
       everything.  */
    int private = (robust
           ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
           : PTHREAD_MUTEX_PSHARED (mutex));
    INTERNAL_SYSCALL_DECL (__err);
    int e = INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
                  __lll_private_flag (FUTEX_LOCK_PI,
                          private), 1, 0);

    if (INTERNAL_SYSCALL_ERROR_P (e, __err)
    && (INTERNAL_SYSCALL_ERRNO (e, __err) == ESRCH
        || INTERNAL_SYSCALL_ERRNO (e, __err) == EDEADLK))
      {
    assert (INTERNAL_SYSCALL_ERRNO (e, __err) != EDEADLK
        || (kind != PTHREAD_MUTEX_ERRORCHECK_NP
            && kind != PTHREAD_MUTEX_RECURSIVE_NP));
    /* ESRCH can happen only for non-robust PI mutexes where
       the owner of the lock died.  */
    assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);

    /* Delay the thread indefinitely.  */
    while (1)
      pause_not_cancel ();
      }

    oldval = mutex->__data.__lock;

    assert (robust || (oldval & FUTEX_OWNER_DIED) == 0);
  }

The end result is that pthread_mutex_lock() ends up in the loop commented "delay the thread indefinitely" blocking the thread forever.

It looks like either the kernel is not returning the correct return code for FUTEX_LOCK_PI (which should probably be EOWNERDEAD) or glibc is not handling the robust mutex correctly as the comment says ESRCH can not happen for robust mutexes, however it does happen.

It could also be relevant that set_robust_list() is only called in the parent process before the fork. Could be that fork() does not clone per-thread robust mutex lists. If the processes are unrelated, the first process doing step 1 and the second doing step 2, the second process successfully locks the mutex. However, pthread_mutex_lock() does not return EOWNERDEAD as it should.

It could also be related to this bug https://bugzilla.redhat.com/show_bug.cgi?id=400541

Comment 1 Maxim Egorushkin 2010-08-31 20:28:09 UTC
Some more details.

I attached a complete test that reproduces the problem.

It executes two processes.

Process 1 creates a file, maps it and initializes a robust shared mutex in it. Then process 1 locks the mutex and terminates. This in theory should make the kernel mark the mutex as abandoned.

Process 2 opens the file, maps it and locks the mutex. It also checks whether the mutex was reported as abandoned (EOWNERDEAD).

The test runs in three modes. In the mode 0 (default mode), it forks process 1, waits for it to terminate and then runs process 2. If a command line argument is provided, it runs only the corresponding process (1 or 2).

What is interesting is that when forking (mode 0) process 2 blocks forever on pthread_mutex_lock() forever. When not forking by running process 1 followed by process 2, process 2 successfully locks the abandoned mutex, but pthread_mutex_lock() does not report EOWNERDEAD. This could be related to the fact that set_robust_list() is called only in the parent process before the fork() (as I said).

[max@truth test]$ g++ -Wall -pthread -o test test.cc

[max@truth test]$ ./test
12216: process 1
12216: opened new file
12216: terminated
12215: process 2
12215: opened existing file
12215: locking shared data
12215: locking mtx_1...
  C-c C-c

[max@truth test]$ ./test 1
12217: process 1
12217: opened new file
12217: terminated

[max@truth test]$ ./test 2
12218: process 2
12218: opened existing file
12218: locking shared data
12218: locking mtx_1...
check failed: mtx_1 is not reported as abandoned12218: mtx_1 locked
12218: locking mtx_2...
check failed: mtx_1 is not reported as abandoned12218: mtx_2 locked

[max@truth test]$ cat test.cc
#include <boost/lexical_cast.hpp>

#include <exception>
#include <new>

#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <pthread.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdarg.h>
#include <errno.h>
#include <stdio.h>
#include <fcntl.h>

namespace {

struct Error : std::exception
{
    char buf_[1024];

    Error(char const* fmt, ...) __attribute__ ((format(printf,2,3)))
    {
        va_list ap;
        va_start(ap, fmt);
        vsnprintf(buf_, sizeof buf_, fmt, ap);
        va_end(ap);
    }

    char const* what() const throw() { return buf_; }
};

void initialize(pthread_mutex_t& mtx)
{
    pthread_mutexattr_t mtxa;
    if(int err = pthread_mutexattr_init(&mtxa))
        throw Error("pthread_mutexattr_init: (%d)%s", err, strerror(err));
    if(int err = pthread_mutexattr_setpshared(&mtxa, PTHREAD_PROCESS_SHARED))
        throw Error("pthread_mutexattr_setpshared: (%d)%s", err, strerror(err));
    // often one process is a real-time and another is not. protect from priority inversion.
    if(int err = pthread_mutexattr_setprotocol(&mtxa, PTHREAD_PRIO_INHERIT))
        throw Error("pthread_mutexattr_setprotocol: (%d)%s", err, strerror(err));
    // if another process is killed while holding the mutex, the other process must be unaffected
    if(int err = pthread_mutexattr_setrobust_np(&mtxa, PTHREAD_MUTEX_ROBUST_NP))
        throw Error("pthread_mutexattr_setrobust: (%d)%s", err, strerror(err));
    if(int err = pthread_mutex_init(&mtx, &mtxa))
        throw Error("pthread_mutex_init: (%d)%s", err, strerror(err));
    pthread_mutexattr_destroy(&mtxa);
}

void initialize(pthread_cond_t& cnd)
{
    pthread_condattr_t cnda;
    if(int err = pthread_condattr_init(&cnda))
        throw Error("pthread_condattr_init: (%d)%s", err, strerror(err));
    if(int err = pthread_condattr_setpshared(&cnda, PTHREAD_PROCESS_SHARED))
        throw Error("pthread_condattr_setpshared: (%d)%s", err, strerror(err));
    if(int err = pthread_cond_init(&cnd, &cnda))
        throw Error("pthread_cond_init: (%d)%s", err, strerror(err));
    pthread_condattr_destroy(&cnda);
}


bool set_consistent(pthread_mutex_t& mtx)
{
    if(int err = pthread_mutex_consistent_np(&mtx))
        throw Error("pthread_mutex_consistent: (%d)%s", err, strerror(err));
    // the mutex is locked, this thread owns it now
    return true;
}

void lock(pthread_mutex_t& mtx, bool* abandoned)
{
    if(int err = pthread_mutex_lock(&mtx)) {
        if(EOWNERDEAD == err) // handle abandoned mutex
            *abandoned = set_consistent(mtx);
        else
            throw Error("pthread_mutex_lock: (%d)%s", err, strerror(err));
    }
    *abandoned = false;
}

void unlock(pthread_mutex_t& mtx)
{
    if(int err = pthread_mutex_unlock(&mtx))
        throw Error("pthread_mutex_unlock: (%d)%s", err, strerror(err));
}

void signal(pthread_cond_t& cnd)
{
    if(int err = pthread_cond_signal(&cnd))
        throw Error("pthread_cond_signal: (%d)%s", err, strerror(err));
}

void wait(pthread_cond_t& cnd, pthread_mutex_t& mtx, bool* abandoned)
{
    if(int err = pthread_cond_wait(&cnd, &mtx)) {
        if(EOWNERDEAD == err) // handle abandoned mutex
            *abandoned = set_consistent(mtx);
        else
            throw Error("pthread_cond_wait: (%d)%s", err, strerror(err));
    }
    else {
        *abandoned = false;
    }
}

struct SharedData
{
    pthread_mutex_t mtx_1;
    pthread_mutex_t mtx_2;
    pthread_cond_t cnd;
    unsigned long event;

    SharedData()
        : event()
    {
        initialize(mtx_1);
        initialize(mtx_2);
        initialize(cnd);
    }

    static SharedData* map(char const* filename, unsigned pid)
    {
        int fd = open(filename, O_CREAT | O_RDWR, mode_t(0666));
        if(fd < 0)
            throw Error("open('%s'): (%d)%s", filename, errno, strerror(errno));
        struct stat st;
        if(fstat(fd, &st))
            throw Error("fstat('%s'): (%d)%s", filename, errno, strerror(errno));
        bool new_file = !st.st_size;
        if(new_file) {
            if(ftruncate(fd, sizeof(SharedData)))
                throw Error("ftruncate('%s'): (%d)%s", filename, errno, strerror(errno));
        }
        void* mem = mmap(NULL, sizeof(SharedData), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
        close(fd);
        if(MAP_FAILED == mem)
            throw Error("mmap('%s'): (%d)%s", filename, errno, strerror(errno));
        if(new_file)
            new (mem) SharedData;
        printf("%u: opened %s file\n", pid, new_file ? "new" : "existing");
        return static_cast<SharedData*>(mem);
    }
};

char const shared_file[] = "shared_file_test~";

pid_t spawn(int(*fn)())
{
    // fork a child process
    pid_t pid = fork();
    switch(pid) {
    case 0:
        exit(fn());
    case -1:
        abort();
    default:
        return pid;
    }
}

int process_1()
{
    unsigned pid = getpid();
    printf("%u: process 1\n", pid);

    SharedData* shared_data = SharedData::map(shared_file, pid);
    bool abandoned_not_used;
    lock(shared_data->mtx_1, &abandoned_not_used); // abandon this mutex, EOWNERDEAD in pthread_cond_wait
    lock(shared_data->mtx_2, &abandoned_not_used); // abandon this mutex, EOWNERDEAD in pthread_mutex_lock
    shared_data->event = 1;
    signal(shared_data->cnd);
    printf("%u: terminated\n", pid);

    return 0;
}

int process_2()
{
    unsigned pid = getpid();
    printf("%u: process 2\n", pid);

    SharedData* shared_data = SharedData::map(shared_file, pid);
    printf("%u: locking shared data\n", pid);

    bool abandoned;
    int failures = 0;

    printf("%u: locking mtx_1...\n", pid);
    lock(shared_data->mtx_1, &abandoned);
    failures += !abandoned;
    if(!abandoned)
        fprintf(stderr, "check failed: mtx_1 is not reported as abandoned");
    printf("%u: mtx_1 locked\n", pid);

    while(1 != shared_data->event)
        wait(shared_data->cnd, shared_data->mtx_1, &abandoned);
    // if(!abandoned)
    //     fprintf("check failed: mtx_1 is not reported as abandoned");

    printf("%u: locking mtx_2...\n", pid);
    lock(shared_data->mtx_2, &abandoned);
    failures += !abandoned;
    if(!abandoned)
        fprintf(stderr, "check failed: mtx_1 is not reported as abandoned");
    printf("%u: mtx_2 locked\n", pid);

    unlock(shared_data->mtx_1);
    unlock(shared_data->mtx_2);

    return failures ? EXIT_FAILURE : EXIT_SUCCESS;
}

}

int main(int ac, char** av)
{
    unsigned mode = ac > 1 ? boost::lexical_cast<unsigned>(1[av]) : 0;
    switch(mode) {
    case 0: {
        // start with a new file
        unlink(shared_file);
        // fork process_1 and wait till it terminates
        pid_t child = spawn(process_1);
        int child_status;
        if(-1 == waitpid(child, &child_status, 0))
            throw Error("waitpid: (%d)%s", errno, strerror(errno));

        // now do process_2
        return process_2();
    }

    case 1:
        // start with a new file
        unlink(shared_file);
        // run only process 1
        return process_1();

    case 2:
        // start with an existing file with abandoned mutexes
        // run only process 2
        return process_2();
    }
}

Max

Comment 2 Maxim Egorushkin 2010-08-31 22:09:05 UTC
Created attachment 442293 [details]
test source code (same as in the post)

Comment 3 Chuck Ebbert 2010-09-01 09:30:07 UTC
How can we tell if this is really a kernel bug and not a bug in the c++ runtime and/or pthreads?

Comment 4 Maxim Egorushkin 2010-09-01 12:02:17 UTC
Update: the case with unrelated processes, contrary to my original report, works correctly, i.e. pthread_mutex_lock() does return EOWNERDEAD.

In the case with forking, when the child process terminates with mutex locked this is what the shared file contains.

$ od -Ax -tx4 shared_file_test~
000000 80004370 00000001 00004370 00000001
000010 000000b0 00000000 aa45a048 00007fbb
...

Note 80004370 value - it's got FUTEX_WAITERS bit set along with TID 0x4370, i.e. it is still owned by the departed thread. So, the kernel does not mark the futex with FUTEX_OWNER_DIED bit as it should.

When forking, set_robust_list() is only ever called in the parent process before forking. The man page for set_robust_list() does not say how its state is handled by fork(). Judging from the state of the mutex above, it looks like the child process loses the robust mutex list, so that the kernel can't walk it and mark the mutex with FUTEX_OWNER_DIED.

There are basically two choices how to handle the robust mutex list head on fork():
a) kernel clones the robust list head on fork()
b) glibc calls set_robust_list() after fork()

Comment 5 Maxim Egorushkin 2010-09-01 12:03:21 UTC
Created attachment 442394 [details]
updated test source code

Comment 6 Maxim Egorushkin 2010-09-01 12:13:53 UTC
Just double checked the test by running it on Solaris 10. It handles the forking case correctly.

Comment 7 Chuck Ebbert 2010-09-02 09:57:57 UTC
If you're trying to show that there is a kernel bug somewhere you're going to have to write a simple example in C that calls kernel APIs directly.

Comment 8 Maxim Egorushkin 2010-09-02 10:57:27 UTC
This test shows a bug in robust mutex behaviour. It was originally assigned to glibc because this is the API the test invokes. 

Is it not a solid starting point for glibc mutex implementation maintainers to find the root cause of the issue?

Comment 9 Jakub Jelinek 2010-09-06 09:29:40 UTC
Doing a set_robust_list syscall after every fork if -lpthread is linked in would be IMHO way too costly, and even doing it after every fork when there is only one thread present (as, in multithreaded process after fork in the child it is not valid to call any of the pthread_mutex_* functions).
I think best would be to do it lazily (perhaps even do it lazily at -lpthread initialization and pthread_create), by testing some flag in the robust mutex handling stuff and doing set_robust_list if the flag isn't set (and setting it).
This flag would be just cleared for the fork child (perhaps by remembering it in the parent, doing fork and restoring it in the parent).

Comment 10 Andreas Schwab 2010-09-06 09:51:19 UTC
Why is the robust list cleared on fork?  That does not look right.

Comment 11 Maxim Egorushkin 2010-09-07 21:42:55 UTC
(In reply to comment #9)
> Doing a set_robust_list syscall after every fork if -lpthread is linked in
> would be IMHO way too costly, and even doing it after every fork when there is
> only one thread present (as, in multithreaded process after fork in the child
> it is not valid to call any of the pthread_mutex_* functions).
> I think best would be to do it lazily (perhaps even do it lazily at -lpthread
> initialization and pthread_create), by testing some flag in the robust mutex
> handling stuff and doing set_robust_list if the flag isn't set (and setting
> it).
> This flag would be just cleared for the fork child (perhaps by remembering it
> in the parent, doing fork and restoring it in the parent).

It would have to check that flag on every mutex lock, I would guess, because if the mutex is in the shared memory it has most likely been initialized by another process. In this case, the first call a child process does on that mutex is pthread_mutex_lock(). Checking that flag in pthread_mutex_lock() may be an unnecessary overhead for the mutex fast path. Again, one could patch the PLT after the first call pthread_mutex_lock() to point to another version of pthread_mutex_lock() which does not do the check, but it may not work with static linking, and looks too dodgy.

The kernel should probably clone the head of the robust mutex list. In this case, no extra system calls are required from the child process.

Just speculating.

Comment 12 Andreas Schwab 2010-09-21 13:32:18 UTC
The kernel should just leave the robust list alone.

Comment 13 Jonathan Wakely 2011-05-10 13:04:02 UTC
Any chance of an update?

The attached testcase is trivially reproducable on Fedora or RHEL5, there's nothing in the C++ runtime that would affect this, the test could be rewritten in C (it would be tedious to do so, but if that's really what's stalling anyone looking at it I can do so)

Comment 14 Jonathan Wakely 2011-05-10 20:44:37 UTC
(In reply to comment #7)
> If you're trying to show that there is a kernel bug somewhere you're going to
> have to write a simple example in C that calls kernel APIs directly.


#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/syscall.h>
#include <syscall.h>

int main()
{
    void *list;
    size_t sz;
    if (syscall(__NR_get_robust_list, 0, &list, &sz))
        abort();
    if (!list)
        abort();
    
    if (fork() == 0)
    {
        if (syscall(__NR_get_robust_list, 0, &list, &sz))
            return 1;
        if (!list)
            return 2;
        return 0;
    }

    int status;
    waitpid(-1, &status, 0);
    if (!WIFEXITED(status))
    {
        printf("child exited abnormally\n");
        return 1;
    }

    switch(WEXITSTATUS(status))
    {
    case 1:
        printf("child failed to call get_robust_list\n");
        return 1;
    case 2:
        printf("child has no robust list\n");
        return 1;
    default:
        printf("child exited normally\n");
    }

    return 0;
}

No C++ or <pthread.h>, just link with -lpthread

This shows that get_robust_list returns NULL in the child process.

Comment 15 Chuck Ebbert 2011-05-10 23:06:39 UTC
(In reply to comment #12)
> The kernel should just leave the robust list alone.

The kernel clears the robust list for the child upon fork, because at that point the newly-created thread does not own any robust futexes (the parent owns all of them, and the list cannot be shared.) If the child inherited the list and it died immediately after forking, the list would be walked and all the parent's futexes would be unlocked. The child needs its own list. This is not a kernel bug.

Comment 16 Andreas Schwab 2011-05-11 07:25:26 UTC
A fork creates a new process, not a thread, so nothing is shared.

Comment 17 Chuck Ebbert 2011-05-11 11:51:55 UTC
(In reply to comment #16)
> A fork creates a new process, not a thread, so nothing is shared.

The kernel makes no distinction between processes and threads. (A thread is just a process that shares its VMAs with the parent process.) In any case, the kernel clears the robust list when creating a new thread/process. This is the defined behavior and Fedora can't change this. If you reassign this to kernel I will close it as NOTABUG.

Comment 18 Andreas Schwab 2011-05-11 12:00:23 UTC
The child does not need its own list.

Comment 19 Chuck Ebbert 2011-05-12 06:19:58 UTC
Not a kernel bug, and glibc maintainers refuse to fix glibc.

Comment 20 Jakub Jelinek 2011-05-12 06:56:30 UTC
(In reply to comment #19)
> Not a kernel bug, and glibc maintainers refuse to fix glibc.

That's wrong.  It is bug, that could be fixed either in the kernel, or in glibc, so it would be good to discuss this upstream where it is more beneficial to fix it.  Changing it in glibc might be more expensive IMNSHO.

Comment 21 Jakub Jelinek 2011-05-12 06:58:33 UTC
I'd add that the fix can be also kernel + glibc, like a new CLONE_* flag that glibc would need to use.

Comment 22 Jonathan Wakely 2011-06-28 00:14:00 UTC
NOTABUG is obviously wrong, because the robust mutex API doesn't work as it's supposed to.  "I don't want this in my bug list" is not a valid reason to say NOTABUG.

Comment 23 Chuck Ebbert 2011-06-29 16:38:10 UTC
(In reply to comment #22)
> NOTABUG is obviously wrong, because the robust mutex API doesn't work as it's
> supposed to.  "I don't want this in my bug list" is not a valid reason to say
> NOTABUG.

It is not a kernel bug, it's the defined behavior for the kernel when a thread is cloned that the robust_list gets cleared. It is a glibc bug that they fail to deal with that, but they simply reassign the bug to kernel instead of fixing it.

Comment 24 Jonathan Wakely 2011-06-30 15:29:30 UTC
I don't care what component it is, there's a bug somewhere, that needs fixing somewhere. Closing it as NOTABUG isn't going to get it fixed anywhere.

Please reopen.

Comment 25 Dave Jones 2011-07-01 00:36:06 UTC
regardless of where the bug is, this is not the place to discuss changes of API behaviour.  The Fedora kernel won't be deviating from upstream in this regard, so if it needs changing, it needs to be discussed there.

Take it to linux-kernel.org, and/or libc-alpha

Comment 26 Jonathan Wakely 2011-07-16 13:37:35 UTC
submitted as http://sourceware.org/bugzilla/show_bug.cgi?id=13002


Note You need to log in before you can comment on or make changes to this bug.