Bug 1332917 - glibc: Deadlock between fflush, getdelim, and fork
Summary: glibc: Deadlock between fflush, getdelim, and fork
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Florian Weimer
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 906468
TreeView+ depends on / blocked
 
Reported: 2016-05-04 11:12 UTC by Florian Weimer
Modified: 2016-05-15 04:53 UTC (History)
9 users (show)

Fixed In Version: glibc-2.22-15.fc23, glibc-2.23.1-7.fc24
Doc Type: Bug Fix
Doc Text:
Clone Of: 906468
Environment:
Last Closed: 2016-05-15 04:53:13 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Sourceware 19431 0 P2 RESOLVED Deadlock between fflush, getdelim, and fork 2020-08-06 13:07:18 UTC

Description Florian Weimer 2016-05-04 11:12:16 UTC
+++ This bug was initially created as a clone of Bug #906468 +++

I run into a deadlock in glibc where fork deadlocks with malloc.  This problem happens after a few hours of running the program under a heavy load.  Heavy load = fork is performed ~100 times/sec.

I am running on a 16-core x86-64 (32 hardware threads), with plenty of RAM and disk space.  Linux 2.6.32-220.el6.x86_64.

The root of the problem appears to be in glibc iogetdelim.c.  The function _IO_getdelim calls _IO_acquire_lock (fp), and then it calls malloc.  malloc then attempts to acquire the lock on the heap, which causes the deadlock since the forking mechanism locks the heap until forking is complete.

The easy patch to this problem appears to be in iogetdelim.c, in the function _IO_getdelim.  The call to malloc should be performed before _IO_acquire_lock (fp).  Since glibc doesn't have the nested locking anymore, the problem is solved.



I am attaching below my detailed analysis of the problem from the core dump of the deadlocked process.  I edited out the symbols which are not relevant to the problem at hand.



Thread 58

Thread 58 (Thread 0x7f0f3fdfd700 (LWP 13994)):
Reading in symbols for ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c...done.
Reading in symbols for iopopen.c...done.
Reading in symbols for src/LuaPluginSessionStart.cpp...done.
#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1  0x00007f0ff1395354 in _L_lock_2167 () at genops.c:598
#2  0x00007f0ff13945ff in *__GI__IO_list_lock () at genops.c:1299
#3  0x00007f0ff13cbc47 in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:117
#4  0x00007f0ff1388977 in _IO_new_proc_open (fp=0x7f0eac0394b0, command=0x7f0eac055a38 "ls /etc/opt/*.conf | grep -v appconf.conf", mode=<optimized out>) at iopopen.c:191
#5  0x00007f0ff1388c69 in _IO_new_popen (command=0x7f0eac055a38 "ls /etc/opt/*.conf | grep -v appconf.conf", mode=0x7f0ff35e54d1 "r") at iopopen.c:312
#6  0x00007f0ff35de160 in io_popen () from /opt/...liblua.so
#7  0x00007f0ff35ce581 in luaD_precall () from /opt/.../liblua.so
#8  0x00007f0ff35d9239 in luaV_execute () from /opt/.../liblua.so
#9  0x00007f0ff35cea4d in luaD_call () from /opt/.../liblua.so
#10 0x00007f0ff35ce0f7 in luaD_rawrunprotected () from /opt/.../liblua.so
#11 0x00007f0ff35ce172 in luaD_pcall () from /opt/.../liblua.so
#12 0x00007f0ff35c9b91 in lua_pcall () from /opt/.../liblua.so
#13 0x00007f0ffd2551c0 in ...
#14 0x00007f0ffd252bb3 in ...
#15 0x00007f0ffd25c3aa in ...
#16 0x00007f0ffd2585df in ...
#17 0x00007f0ffcfa2d0c in ...
#18 0x00007f0ffcfa34c1 in ...
#19 0x00007f0ffcf6e4e5 in ...
#20 0x00007f0ffcf6f74e in ...
#21 0x00007f0ffcf5760e in ...
#22 0x00007f0ffcf5ab57 in ...
#23 0x00007f0ffcf5ae65 in ...
#24 0x0000000000408c4d in ...
#25 0x00007f0ff4ee0184 in g_thread_create_proxy (data=0x7f0fd4002cd0) at gthread.c:635
#26 0x00007f0ff319d7f1 in start_thread (arg=0x7f0f3fdfd700) at pthread_create.c:301
#27 0x00007f0ff140670d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Stuck in fork waiting for list_all_lock in genops.c:1299

(gdb) frame 2
#2  0x00007f0ff13945ff in *__GI__IO_list_lock () at genops.c:1299
1299      _IO_lock_lock (list_all_lock);
(gdb) print list_all_lock
$2 = {lock = 2, cnt = 1, owner = 0x7f0f3fafa700}

Thread 0x7f0f3fafa700 == gdb thread 65

Thread 65 (Thread 0x7f0f3fafa700 (LWP 14019)):
#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1  0x00007f0ff1395287 in _L_lock_1239 () at genops.c:598
#2  0x00007f0ff1393f0b in _IO_flush_all_lockp (do_lock=1) at genops.c:840
#3  0x00007f0ff35de155 in io_popen () from /opt/.../liblua.so
#4  0x00007f0ff35ce581 in luaD_precall () from /opt/.../liblua.so
#5  0x00007f0ff35d9239 in luaV_execute () from /opt/.../liblua.so
#6  0x00007f0ff35cea4d in luaD_call () from /opt/.../liblua.so
#7  0x00007f0ff35ce0f7 in luaD_rawrunprotected () from /opt/.../liblua.so
#8  0x00007f0ff35ce172 in luaD_pcall () from /opt/.../liblua.so
#9  0x00007f0ff35c9b91 in lua_pcall () from /opt/.../liblua.so
#10 0x00007f0ffd2551c0 in ...
#11 0x00007f0ffd252bb3 in ...
#12 0x00007f0ffd25de88 in ...
#13 0x00007f0ffd25843f in ...
#14 0x00007f0ffcfc4d7d in ...
#15 0x00007f0ffcfc57ed in ...
#16 0x00007f0ffcf6d394 in ...
#17 0x00007f0ffcf6f74e in ...
#18 0x00007f0ffcf5760e in ...
#19 0x00007f0ffcf5ab57 in ...
#20 0x00007f0ffcf5ae65 in ...
#21 0x0000000000408c4d in ...
#22 0x00007f0ff4ee0184 in g_thread_create_proxy (data=0x7f0fd4003330) at gthread.c:635
#23 0x00007f0ff319d7f1 in start_thread (arg=0x7f0f3fafa700) at pthread_create.c:301
#24 0x00007f0ff140670d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Inspecting code in frame 2, we see the following code:


#ifdef _IO_MTSAFE_IO
  _IO_cleanup_region_start_noarg (flush_cleanup);
  if (do_lock)
    _IO_lock_lock (list_all_lock);
#endif

  last_stamp = _IO_list_all_stamp;
  fp = (_IO_FILE *) INTUSE(_IO_list_all);
  while (fp != NULL)
    {
      run_fp = fp;
      if (do_lock)
	_IO_flockfile (fp);
	

Before getting to _IO_flockfile (fp), the thread calls _IO_lock_lock (list_all_lock).  

(gdb) print list_all_lock
$4 = {lock = 2, cnt = 1, owner = 0x7f0f3fafa700}


This confirms that list_all_lock is taken by thread 65.

Now find the ownership of fp.  So:

(gdb) print *fp
$6 = {_flags = -72539000, _IO_read_ptr = 0x0, _IO_read_end = 0x0, _IO_read_base = 0x0, _IO_write_base = 0x0, _IO_write_ptr = 0x0, _IO_write_end = 0x0, _IO_buf_base = 0x0, _IO_buf_end = 0x0, _IO_save_base = 0x0, 
  _IO_backup_base = 0x0, _IO_save_end = 0x0, _markers = 0x0, _chain = 0x7f0eb0063a70, _fileno = 42, _flags2 = 0, _old_offset = 577, _cur_column = 0, _vtable_offset = 2 '\002', 
  _shortbuf =  <incomplete sequence \334>, _lock = 0x7f0edc05bf70, _offset = -1, _codecvt = 0x666e496e6f002d25, _wide_data = 0x7f0edc05bf80, _freeres_list = 0x0, _freeres_buf = 0x7f0edc026ca0, 
  _freeres_size = 139701797257336, _mode = 0, _unused2 = "\"byt%-%O:%M:%S%F\000\060\067\060"}

(gdb) print *(fp._lock)
$8 = {lock = 2, cnt = 1, owner = 0x7f0fac24f700}

Thread 0x7f0fac24f700 == gdb thread 71

Comment 1 Fedora Update System 2016-05-07 17:19:37 UTC
glibc-2.22-15.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-68abc0be35

Comment 2 Fedora Update System 2016-05-08 16:25:30 UTC
glibc-2.22-15.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-68abc0be35

Comment 3 Florian Weimer 2016-05-09 09:24:18 UTC
Fixed with these new 2.23 backports:

commit 888d9a0146b4b8364e065ab359eae5b3db5badb9
Author: Florian Weimer <fweimer>
Date:   Thu Apr 14 12:53:03 2016 +0200

    malloc: Add missing internal_function attributes on function definitions
    
    Fixes build on i386 after commit 29d794863cd6e03115d3670707cc873a9965ba92.
    
    (cherry picked from commit 186fe877f3df0b84d57dfbf0386f6332c6aa69bc)

commit 927170dd59787d9443e07eeb0b22329c4eff1530
Author: Florian Weimer <fweimer>
Date:   Thu Apr 14 09:18:30 2016 +0200

    malloc: Remove malloc hooks from fork handler
    
    The fork handler now runs so late that there is no risk anymore that
    other fork handlers in the same thread use malloc, so it is no
    longer necessary to install malloc hooks which made a subset
    of malloc functionality available to the thread that called fork.
    
    (cherry picked from commit 8a727af925be63aa6ea0f5f90e16751fd541626b)

commit 2a71cf409681b89ffb8892b35cac64de79b7adb8
Author: Florian Weimer <fweimer>
Date:   Thu Apr 14 09:17:02 2016 +0200

    malloc: Run fork handler as late as possible [BZ #19431]
    
    Previously, a thread M invoking fork would acquire locks in this order:
    
      (M1) malloc arena locks (in the registered fork handler)
      (M2) libio list lock
    
    A thread F invoking flush (NULL) would acquire locks in this order:
    
      (F1) libio list lock
      (F2) individual _IO_FILE locks
    
    A thread G running getdelim would use this order:
    
      (G1) _IO_FILE lock
      (G2) malloc arena lock
    
    After executing (M1), (F1), (G1), none of the threads can make progress.
    
    This commit changes the fork lock order to:
    
      (M'1) libio list lock
      (M'2) malloc arena locks
    
    It explicitly encodes the lock order in the implementations of fork,
    and does not rely on the registration order, thus avoiding the deadlock.
    
    (cherry picked from commit 29d794863cd6e03115d3670707cc873a9965ba92)

commit a5c2f42566460fc73755c768e8e1c59dbd5a4bb2
Author: Samuel Thibault <samuel.thibault>
Date:   Tue Mar 22 09:58:48 2016 +0100

    Fix malloc threaded tests link on non-Linux
    
        * malloc/Makefile ($(objpfx)tst-malloc-backtrace,
        $(objpfx)tst-malloc-thread-exit, $(objpfx)tst-malloc-thread-fail): Use
        $(shared-thread-library) instead of hardcoding the path to libpthread.
    
    (cherry picked from commit b87e41378beca3c98ec3464d64835e66cc788497)

commit f69ae17e843b00d3495b736f4381c1fa64dc02bc
Author: Florian Weimer <fweimer>
Date:   Fri Feb 19 17:07:45 2016 +0100

    malloc: Remove NO_THREADS
    
    No functional change.  It was not possible to build without
    threading support before.
    
    (cherry picked from commit 59eda029a8a35e5f4e5cd7be0f84c6629e48ec6e)

Comment 4 Fedora Update System 2016-05-09 14:56:51 UTC
glibc-2.23.1-6.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-b321728d74

Comment 5 Fedora Update System 2016-05-10 17:57:36 UTC
glibc-2.22-15.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 6 Fedora Update System 2016-05-10 20:30:11 UTC
glibc-2.23.1-6.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-b321728d74

Comment 7 Fedora Update System 2016-05-11 13:55:14 UTC
glibc-2.23.1-7.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-b321728d74

Comment 8 Fedora Update System 2016-05-12 09:44:06 UTC
glibc-2.23.1-7.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-b321728d74

Comment 9 Fedora Update System 2016-05-14 23:30:02 UTC
glibc-2.23.1-7.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.