Created attachment 1111139 [details]
Program to reproduce the problem
Description of problem:
Yes i mark this thing urgent as it has imo the potential to break any notably loaded web services where apache is forking subprocesses, what i guess will be quite a number.
A multithreaded process is hanging during fork after having done some malloc and free calls. The problem is more likely, the more threads the process has.
Version-Release number of selected component (if applicable): glibc-2.22.6
Compile and run the attached program. Using preprocessor flags one can specify the number of threads (macro NTHREADS, default=40) and the likelyness a thread does a fork in it's endless loop (default 0.1). Example:
gcc tmf.c -o tmf -lpthread -g -DNTHREADS=32
What the program does is to create a number of threads. Each of the threads is an endless loop doing malloc or free of a list of pointers. With a certain chance (default 10%) it does a fork and executes usleep 1 in a subprocess.
The problem is, that the more threads are started, the more likely all threads of the process hang in a futex system call with a stack like this:
#0 0xb7757bc8 in __kernel_vsyscall ()
#1 0xb764e392 in __lll_lock_wait_private () from /lib/libc.so.6
#2 0xb75bf454 in malloc () from /lib/libc.so.6
#3 0x08048885 in th (nth=0xe) at tmf.c:57
#4 0xb771b452 in start_thread () from /lib/libpthread.so.0
#5 0xb76402fe in clone () from /lib/libc.so.6
On my Pentium single processor machine with 8 threads the process almost never hangs and uses all CPU it can get, as it should. With 40 threads the process always hangs and inbetween the more likely, the more threads it has.
Steps to Reproduce:
1. gcc tmf.c -o tmf -lpthread -DNTHREADS=40
3. if you like, attach strace or gdb to the process and see more details
All threads are continuously running, doing malloc and free and fork ...
The phenomenon is never seen with lower glibc versions e.g. on Fedora 21 or 22.
So imo this is clearly a new problem with the potential to break many services.
There is a similar bug open: https://bugzilla.redhat.com/show_bug.cgi?id=906468 . Nonetheless as this one refers to RHEL 7 and dates back to 2013, i open this new one, because my suspicion is, that it is a different issue. Could be, that trying to fix 906468 this problem here has been introduced.
This is likely due to this bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19182
Bug 906468 is indeed a different problem.
Note that the test case needs to run with an environment setting like MALLOC_ARENA_MAX=20 if you have sufficiently many cores.
Sorry, i don't understand, what test case you mean. Should this environment setting change the behaviour of my program (currently don't have access to my Fedora 23 machine - later today) ? In my understanding MALLOC_ARENA_MAX is a parameter effecting performance, not whether sth. works or not.
In general i agree, seems to me, too, the issue reported at sourceware.org with id 19182.
So will it be available / backported to Fedora 23 ?
I'd gladly volunteer to test / verify whatever necessary.
(In reply to Albert Flügel from comment #3)
> Sorry, i don't understand, what test case you mean. Should this environment
> setting change the behaviour of my program (currently don't have access to
> my Fedora 23 machine - later today) ? In my understanding MALLOC_ARENA_MAX
> is a parameter effecting performance, not whether sth. works or not.
On my test machine, the arena limit is 64. This means that with 40 threads, all receive distinct arenas, and reused_arena is never called. The deadlock happens due to an interaction between the malloc fork handler and the reused_arena function. Lowering the arena limit 20 means that reused_arena is actually called.
> So will it be available / backported to Fedora 23 ?
I'm preparing updated packages right now.
glibc-2.22-7.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-61b86643d4
(In reply to Albert Flügel from comment #3)
> I'd gladly volunteer to test / verify whatever necessary.
I would appreciate if you could test the glibc-2.22-7.fc23 update I just submitted.
Confirmed. Cannot reproduce the problem anymore. The reproducer fails and openvenus (framework and monitoring) works again. Super cool !
glibc-2.22-7.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-61b86643d4
glibc-2.22-7.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.