Bug 115429 - pthreads fork and memory allocation (calloc etc) deadlock
pthreads fork and memory allocation (calloc etc) deadlock
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: glibc (Show other bugs)
2.1
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Jakub Jelinek
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-02-12 08:17 EST by David Simms
Modified: 2016-11-24 10:20 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-08-18 05:28:05 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Reproducer (2.00 KB, text/plain)
2004-02-12 08:19 EST, David Simms
no flags Details
gdb backtraces - see thread 3 & 4 (4.12 KB, text/plain)
2004-02-12 08:22 EST, David Simms
no flags Details
Update so child only exits (2.00 KB, text/plain)
2004-02-12 08:37 EST, David Simms
no flags Details

  None (edit)
Description David Simms 2004-02-12 08:17:22 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113

Description of problem:
Due to a problem in another application wrote a simple test program
that creates a number of threads which repeatedly call calloc, malloc,
and fork. (m/c/re)alloc via sbrk has been disabled by mmaping at the
end of brk for the purpose of simulating the real application.

Will attach "threadforks.c", compile with "gcc -g -lpthread -o
threadforks threadforks.c".

Running the program without any options will dead-lock.

Will attach appropriate gdb backtraces.

Thread 4: attempting to fork calls "ptmalloc_lock_all" and is holding
the "list_lock" and trying to obtain "main_arena.mutex" lock.

Thread 3: attempting to calloc, using sbrk fails, locks
"main_arena.mutex" and calls arena_get2 which in turn attempts to lock
"list_lock".

Obviously a race condition, but quite reproduceable. 

Seems that avoiding "brk" data segment is a work-around.

Version-Release number of selected component (if applicable):
glibc-2.2.4-32.11

How reproducible:
Always

Steps to Reproduce:
1. Create a number of threads.
2. Let each thread call calloc and fork.

Will attach "threadforks.c", compile with "gcc -g -lpthread -o
threadforks threadforks.c".

Run while [ 1 ]; do ./testforks; done
    

Actual Results:  Dead-locks

Expected Results:  Program should run to completion

Additional info:

Kernel: 2.4.9-e.37smp
Machine Dual HT Xeon 2Ghz
Comment 1 David Simms 2004-02-12 08:19:12 EST
Created attachment 97604 [details]
Reproducer

Reproducer
Comment 2 David Simms 2004-02-12 08:22:14 EST
Created attachment 97605 [details]
gdb backtraces - see thread 3 & 4

gdb backtraces - see thread 3 (calloc) & 4 (fork)
Comment 3 Jakub Jelinek 2004-02-12 08:27:29 EST
You cannot do that in a threaded program.
See http://www.opengroup.org/onlinepubs/007904975/functions/fork.html
If a multi-threaded process calls fork(), the new process shall
contain a replica of the calling thread and its entire address space,
other resources. Consequently, to avoid errors, the child process may
until such time as one of the exec functions is called.

calloc is certainly not async-signal safe function, see
http://www.opengroup.org/onlinepubs/007904975/functions/xsh_chap02_04.
Comment 4 Jakub Jelinek 2004-02-12 08:31:03 EST
Argh, end lines stripped off during mid-air collision, sorry.
If a multi-threaded process calls fork(), the new process shall
contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and
other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations
until such time as one of the exec functions is called.

calloc is certainly not async-signal safe function, see
http://www.opengroup.org/onlinepubs/007904975/functions/xsh_chap02_04.html#tag_02_04_03
Comment 5 David Simms 2004-02-12 08:37:38 EST
Created attachment 97606 [details]
Update so child only exits

Update so child only exits
Comment 6 David Simms 2004-02-12 08:40:34 EST
Correct me if I'm wrong but it seems like the internal malloc_atfork
hook is installed too late ?

Attached updated version of reproducer. Still dead-locks without child
process behaving badly.
Comment 7 David Simms 2004-02-12 08:44:11 EST
So the alternative is that we suspend all other threads before
entering fork ?
Comment 8 David Simms 2004-02-13 05:03:16 EST
So I'm just making sure this problem is clear:

The updated reproducer "Update so child only exits" contains perfectly
"legal" code. The test still hangs, due to a dead-lock between
ptmalloc_lock_all and calloc.

The forking thread has yet to even enter the fork system call (only
one thread can: pthread_atfork_lock), whilst a thread already inside
the calloc implementation dead-locks with ptmalloc_lock_all.

I could suggest that the order of operations (lock acquisition and
atfork hook switching to malloc_atfork) in ptmalloc_lock_all are the
cause ??? 

I assume ptmalloc_lockall/malloc_atfork mechanism is intended as a
locking protocol to ensure that the malloc system can be placed into
an "initialized state" for the new child process, locking out all
other threads to protect the childs view of it's copy-on-write memory.

If you run threadforks in gdb you will find that it is possible this
mechanism to dead-lock with threads already in the calloc implementation.
Comment 9 Ulrich Drepper 2005-07-25 18:56:56 EDT
Does all this happens with newer releases (RHEL3/4), too?  The AS2.1 code is
modified only for critical problems and security issues.  The fork() handling in
LinuxThreads has always been buggy because of the broken foundation.  This is
why there is NPTL now.

I'm going to close this as WONTFIX unless I hear a good argument why not.
Comment 10 Danilo Sartori 2005-07-26 03:26:16 EDT
Here is the status of a multithreaded process as seen by means of gdb:

> Attaching to program: /sbin/scanner, process 2267
> Reading symbols from /lib/tls/libpthread.so.0...(no debugging symbols
found)...done.
> [Thread debugging using libthread_db enabled]
> [New Thread -161695488 (LWP 2267)]
> [New Thread -194565200 (LWP 2272)]
> [New Thread -184075344 (LWP 2271)]
> [New Thread -173585488 (LWP 2270)]
> [New Thread -163095632 (LWP 2269)]
> Loaded symbols for /lib/tls/libpthread.so.0
> Reading symbols from /lib/tls/libc.so.6...(no debugging symbols found)...done.
> Loaded symbols for /lib/tls/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
> Loaded symbols for /lib/ld-linux.so.2
> 0xf6505334 in malloc_consolidate () from /lib/tls/libc.so.6
> (gdb) info threads
>   5 Thread -163095632 (LWP 2269)  0xf657cd19 in __lll_mutex_lock_wait () from
/lib/tls/libc.so.6
>   4 Thread -173585488 (LWP 2270)  0xf653c83c in __nanosleep_nocancel () from
/lib/tls/libc.so.6
>   3 Thread -184075344 (LWP 2271)  0xf6570d0a in msgrcv () from /lib/tls/libc.so.6
>   2 Thread -194565200 (LWP 2272)  0xf65d5c5e in accept () from
/lib/tls/libpthread.so.0
>   1 Thread -161695488 (LWP 2267)  0xf6505334 in malloc_consolidate () from
/lib/tls/libc.so.6
> (gdb) thread 5
> [Switching to thread 5 (Thread -163095632 (LWP 2269))]#0  0xf657cd19 in
__lll_mutex_lock_wait () from /lib/tls/libc.so.6
> (gdb) bt
> #0  0xf657cd19 in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
> #1  0xf65083f4 in _L_mutex_lock_10361 () from /lib/tls/libc.so.6
> #2  0x0000005d in ?? ()
> #3  0xf65c7898 in __DTOR_END__ () from /lib/tls/libc.so.6
> #4  0xf65c5ec0 in stderr () from /lib/tls/libc.so.6
> #5  0xf6507bab in ptmalloc_lock_all () from /lib/tls/libc.so.6
> #6  0xf653cac6 in fork () from /lib/tls/libc.so.6
> #7  0xf65d7364 in fork () from /lib/tls/libpthread.so.0
> #8  0x0804e1dd in connect_to_server ()
> #9  0x0804e395 in conditional_connect ()
> #10 0x08053a9b in th_conn2unica_analyze_message ()
> #11 0x08054240 in th_conn2unica_main ()
> #12 0xf65d0de8 in start_thread () from /lib/tls/libpthread.so.0
> #13 0xf656f93a in clone () from /lib/tls/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread -161695488 (LWP 2267))]#0  0xf6505334 in
malloc_consolidate () from /lib/tls/libc.so.6
> (gdb) bt
> #0  0xf6505334 in malloc_consolidate () from /lib/tls/libc.so.6
> #1  0xf6504b29 in _int_malloc () from /lib/tls/libc.so.6
> #2  0xf6503ecd in malloc () from /lib/tls/libc.so.6
> #3  0xf6537b95 in opendir () from /lib/tls/libc.so.6
> #4  0xf653806f in scandir () from /lib/tls/libc.so.6
> #5  0x080575eb in scan_home ()
> #6  0x08058b81 in main ()

Thread 5 is inside a fork(), thread 1 is inside a malloc(), and are deadlocked.
It happened (and happens) on RHEL3 update 4. Do I have to use a library
different from pthread?
Comment 11 Danilo Sartori 2005-08-01 06:43:32 EDT
It seems that the deadlock seen on RHEL 3 (comment #10) is related to a block of
dynamically allocated memory which was freed twice in my own code. Fixing this
bug I have never had deadlocks anymore, so please do not take into account my
previous comment.

Note You need to log in before you can comment on or make changes to this bug.