Bug 241665

Summary:	mtrace() deadlock in a multi-threaded environment
Product:	[Fedora] Fedora	Reporter:	Olivier Baudron <olivier.baudron>
Component:	glibc	Assignee:	Jakub Jelinek <jakub>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	F8	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-06-22 17:12:15 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Olivier Baudron 2007-05-29 13:41:19 UTC

Version: glibc-2.5-10.fc6

When mtrace() is used and several threads fork() and exec() a deadlock can
happen. A testcase follows. It hangs with more than 50% probability on a
bi-proc. strace shows that some futex waits forever.

/* -------- testcase.c -------- */
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include <errno.h>
#include <string.h>
#include <sys/wait.h>
#include <mcheck.h>

void check_error (int err) {
    if (err) {
        printf ("Error : %s\n", strerror (errno));
        exit (1);
    }
}

void *new_thread (void *param) {
    int err;
    pid_t child, pid;

    child = fork ();
    check_error (child == -1);

    if (child == 0) {
        err = execlp ("echo", "echo", "-n", (char *) NULL);
        check_error (err == -1);
    }

    pid = waitpid (child, NULL, 0);
    check_error (pid == -1);

    return NULL;
}

int main (int argc, char **argv) {
    int i, err;
    pthread_t thread[50];
    pthread_attr_t attr;

    mtrace ();
    err = pthread_attr_init (&attr);
    check_error (err);

    for (i=0; i<50; i++) {
        err = pthread_create (thread+i, &attr, new_thread, NULL);
        check_error (err);
    }

    for (i=0; i<50; i++) {
        err = pthread_join (thread[i], NULL);
        check_error (err);
    }

    return 0;
}
/* -------- end -------- */

$ gcc -Wall -o testcase testcase.c -lpthread
$ ./testcase

Comment 1 Jakub Jelinek 2007-05-30 14:25:13 UTC

Can't reproduce this, with the same glibc, on quad core x86_64.  Tried both -m32
and -m64 builds, and running the proglet both on all CPUs and with
taskset -c 0,2 ./testcase.  In each case several thousands of invocations without
a hang.
Can you please install glibc-debuginfo for you arch, attach with gdb to it
and get all backtraces?

Comment 2 Olivier Baudron 2007-06-04 15:57:47 UTC

I forgot to mention that you need to set MALLOC_TRACE:
$ export MALLOC_TRACE=trace.txt
$ ./testcase

Two process hangs:
$ ps x
23752 pts/4    Sl+    0:00 ./testcase
23849 pts/4    S+     0:00 ./testcase

For process 23752:
------------------

(gdb) info threads
  2 Thread -1669952624 (LWP 23841)  0x00672402 in __kernel_vsyscall ()
  1 Thread -1208396096 (LWP 23752)  0x00672402 in __kernel_vsyscall ()

(gdb) thread 1
(gdb) bt
#0  0x00672402 in __kernel_vsyscall ()
#1  0x00da84b7 in pthread_join (threadid=2625014672, thread_return=0x0) at
pthread_join.c:89
#2  0x0804874c in main () at testcase.c:52

(gdb) thread 2
(gdb) bt
#0  0x00672402 in __kernel_vsyscall ()
#1  0x00daecbb in ?? () from /lib/libpthread.so.0
#2  0x08048687 in new_thread (param=0x0) at testcase.c:29
#3  0x00da73db in start_thread (arg=0x9c768b90) at pthread_create.c:296
#4  0x001ce26e in ?? () from /lib/libc.so.6

For process 23849:
------------------

(gdb) bt
#0  0x00672402 in __kernel_vsyscall ()
#1  0x001dab6e in ?? () from /lib/libc.so.6
#2  0x0016c717 in ?? () at mtrace.c:188 from /lib/libc.so.6
Previous frame identical to this frame (corrupt stack?)

Comment 3 Olivier Baudron 2007-06-11 19:21:01 UTC

Is someone able to reproduce the bug?

Comment 4 Olivier Baudron 2007-06-15 20:29:07 UTC

I think I have found the problem.
mtrace() uses a mutex "lock" to write in the output text file. When "lock" is
being acquired while another thread fork() the child inherits a locked (or
inconsistent mutex) that cannot be unlocked. Therefore the child waits forever
when execve() calls malloc() because the lock cannot be acquired to write in the
output text file. One solution is to use pthread_atfork().

Also, I retried the testcase on a single CPU and this time the stack is not
corrupted. In each child it locks with the following stack:

(gdb) bt
#0  0xb7fcc410 in __kernel_vsyscall ()
#1  0x41686b6e in __lll_mutex_lock_wait () from /lib/libc.so.6
#2  0x41618717 in _L_lock_858 () at mtrace.c:188
#3  0x416185fc in tr_mallochook (size=137, caller=0x4163b5b5) at mtrace.c:182
#4  0x41614e85 in *__GI___libc_malloc (bytes=137) at malloc.c:3463
#5  0x4163b5b5 in *__GI_execvp (file=0x804885f "echo", argv=0x7f745388) at
execvp.c:108
#6  0x4163b8be in *__GI_execlp (file=0x804885f "echo", arg=0x804885f "echo") at
execlp.c:67
#7  0x08048657 in new_thread (param=0x0) at foo.c:25
#8  0x417203db in start_thread (arg=0x7f746b90) at pthread_create.c:296
#9  0x4167a26e in clone () from /lib/libc.so.6

I hope this helps. In any case, any advice is welcome.

Comment 5 Jakub Jelinek 2007-06-22 17:12:15 UTC

mtrace is inherently thread unsafe due to the use of malloc hooks.
Plus many of the __libc_freeres hooks aren't thread safe either.  While
valgrind stops all threads before calling __libc_freeres, mtrace doesn't.
In summary, for multithreaded programs use valgrind instead of mtrace.

Comment 6 Olivier Baudron 2008-06-21 13:53:22 UTC

Fixed.