Bug 1295189 - fork hangs after mallocs in multithreaded processes
Summary: fork hangs after mallocs in multithreaded processes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: 23
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Florian Weimer
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-03 12:15 UTC by Albert Flügel
Modified: 2016-01-05 21:55 UTC (History)
8 users (show)

Fixed In Version: glibc-2.22-7.fc23
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-01-05 21:55:14 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Program to reproduce the problem (2.38 KB, text/x-csrc)
2016-01-03 12:15 UTC, Albert Flügel
no flags Details

Description Albert Flügel 2016-01-03 12:15:27 UTC
Created attachment 1111139 [details]
Program to reproduce the problem

Description of problem:
Yes i mark this thing urgent as it has imo the potential to break any notably loaded web services where apache is forking subprocesses, what i guess will be quite a number.

A multithreaded process is hanging during fork after having done some malloc and free calls. The problem is more likely, the more threads the process has.

Version-Release number of selected component (if applicable): glibc-2.22.6


How reproducible:
Compile and run the attached program. Using preprocessor flags one can specify the number of threads (macro NTHREADS, default=40) and the likelyness a thread does a fork in it's endless loop (default 0.1). Example:
gcc tmf.c -o tmf -lpthread -g -DNTHREADS=32

What the program does is to create a number of threads. Each of the threads is an endless loop doing malloc or free of a list of pointers. With a certain chance (default 10%) it does a fork and executes usleep 1 in a subprocess.

The problem is, that the more threads are started, the more likely all threads of the process hang in a futex system call with a stack like this:
#0  0xb7757bc8 in __kernel_vsyscall ()
#1  0xb764e392 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0xb75bf454 in malloc () from /lib/libc.so.6
#3  0x08048885 in th (nth=0xe) at tmf.c:57
#4  0xb771b452 in start_thread () from /lib/libpthread.so.0
#5  0xb76402fe in clone () from /lib/libc.so.6

On my Pentium single processor machine with 8 threads the process almost never hangs and uses all CPU it can get, as it should. With 40 threads the process always hangs and inbetween the more likely, the more threads it has.

Steps to Reproduce:
1. gcc tmf.c -o tmf -lpthread -DNTHREADS=40
2. ./tmf
3. if you like, attach strace or gdb to the process and see more details

Actual results:
Process hangs

Expected results:
All threads are continuously running, doing malloc and free and fork ...

Additional info:
The phenomenon is never seen with lower glibc versions e.g. on Fedora 21 or 22.
So imo this is clearly a new problem with the potential to break many services.
There is a similar bug open: https://bugzilla.redhat.com/show_bug.cgi?id=906468 . Nonetheless as this one refers to RHEL 7 and dates back to 2013, i open this new one, because my suspicion is, that it is a different issue. Could be, that trying to fix 906468 this problem here has been introduced.

Comment 1 Florian Weimer 2016-01-03 16:42:21 UTC
This is likely due to this bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19182

Bug 906468 is indeed a different problem.

Comment 2 Florian Weimer 2016-01-04 09:35:28 UTC
Note that the test case needs to run with an environment setting like MALLOC_ARENA_MAX=20 if you have sufficiently many cores.

Comment 3 Albert Flügel 2016-01-04 09:44:05 UTC
Sorry, i don't understand, what test case you mean. Should this environment setting change the behaviour of my program (currently don't have access to my Fedora 23 machine - later today) ? In my understanding MALLOC_ARENA_MAX is a parameter effecting performance, not whether sth. works or not.
In general i agree, seems to me, too, the issue reported at sourceware.org with id 19182.
So will it be available / backported to Fedora 23 ?
I'd gladly volunteer to test / verify whatever necessary.

Comment 4 Florian Weimer 2016-01-04 09:50:06 UTC
(In reply to Albert Flügel from comment #3)
> Sorry, i don't understand, what test case you mean. Should this environment
> setting change the behaviour of my program (currently don't have access to
> my Fedora 23 machine - later today) ? In my understanding MALLOC_ARENA_MAX
> is a parameter effecting performance, not whether sth. works or not.

On my test machine, the arena limit is 64.  This means that with 40 threads, all receive distinct arenas, and reused_arena is never called.  The deadlock happens due to an interaction between the malloc fork handler and the reused_arena function.  Lowering the arena limit 20 means that reused_arena is actually called.

> So will it be available / backported to Fedora 23 ?

I'm preparing updated packages right now.

Comment 5 Fedora Update System 2016-01-04 12:10:02 UTC
glibc-2.22-7.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-61b86643d4

Comment 6 Florian Weimer 2016-01-04 12:17:12 UTC
(In reply to Albert Flügel from comment #3)
> I'd gladly volunteer to test / verify whatever necessary.

I would appreciate if you could test the glibc-2.22-7.fc23 update I just submitted.

Comment 8 Albert Flügel 2016-01-04 18:26:52 UTC
Confirmed. Cannot reproduce the problem anymore. The reproducer fails and openvenus (framework and monitoring) works again. Super cool !

Comment 9 Fedora Update System 2016-01-04 20:53:19 UTC
glibc-2.22-7.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-61b86643d4

Comment 10 Fedora Update System 2016-01-05 21:55:10 UTC
glibc-2.22-7.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.