Bug 1909920 - glibc: Linking the main program with jemalloc causes sysconf to deadlock in audit mode
Summary: glibc: Linking the main program with jemalloc causes sysconf to deadlock in a...
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Siddhesh Poyarekar
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-22 03:22 UTC by Siddhesh Poyarekar
Modified: 2021-06-16 05:37 UTC (History)
13 users (show)

Fixed In Version: glibc-2.33.9000-9.fc35.x86_64
Clone Of: 1878932
Environment:
Last Closed: 2021-06-16 05:37:40 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Siddhesh Poyarekar 2020-12-22 03:22:35 UTC
On Fedora 33 with the system jemalloc, a simple C program just deadlocks:

(gdb) bt
#0  __lll_lock_wait (futex=0x7ffff76032a8, private=0) at lowlevellock.c:52
#1  0x00007ffff7967763 in __GI___pthread_mutex_lock (mutex=0x7ffff76032a8)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x00007ffff7ba0475 in je_malloc_mutex_lock_slow ()
   from /lib64/libjemalloc.so.2
#3  0x00007ffff7bbeef8 in extent_recycle.isra ()
   from /lib64/libjemalloc.so.2
#4  0x00007ffff7b6c4d7 in arena_bin_malloc_hard.lto_priv ()
   from /lib64/libjemalloc.so.2
#5  0x00007ffff7bc65ed in je_arena_tcache_fill_small.constprop ()
   from /lib64/libjemalloc.so.2
#6  0x00007ffff7b5d558 in je_malloc_default ()
   from /lib64/libjemalloc.so.2
#7  0x00007ffff79fa8e4 in __GI__IO_file_doallocate (
    fp=0x7ffff7b4b520 <_IO_2_1_stdout_>) at filedoalloc.c:101
#8  0x00007ffff7a092a0 in __GI__IO_doallocbuf (
    fp=0x7ffff7b4b520 <_IO_2_1_stdout_>) at libioP.h:948
#9  __GI__IO_doallocbuf (fp=0x7ffff7b4b520 <_IO_2_1_stdout_>)
    at genops.c:342
#10 0x00007ffff7a08438 in _IO_new_file_overflow (
    f=0x7ffff7b4b520 <_IO_2_1_stdout_>, ch=-1) at fileops.c:745
#11 0x00007ffff7a074e6 in _IO_new_file_xsputn (n=4, data=<optimized out>, 
    f=<optimized out>) at libioP.h:948
#12 _IO_new_file_xsputn (f=0x7ffff7b4b520 <_IO_2_1_stdout_>, 
    data=<optimized out>, n=4) at fileops.c:1197
#13 0x00007ffff79f2219 in outstring_func (done=0, length=<optimized out>, 
    string=<optimized out>, s=0x7ffff7b4b520 <_IO_2_1_stdout_>)
    at ../libio/libioP.h:948
#14 __vfprintf_internal (s=0x7ffff7b4b520 <_IO_2_1_stdout_>, 
    format=0x402010 "%ld\n", ap=0x7fffffffdcc0, mode_flags=0)
    at vfprintf-internal.c:1646
#15 0x00007ffff79de4af in __printf (format=<optimized out>) at printf.c:33
#16 0x0000000000401156 in main ()

C sources:

#include <unistd.h>
#include <stdio.h>

int
main (void)
{
  printf ("%ld\n", sysconf (_SC_PAGESIZE));
}

Debugging this is difficult because of the lack of audit namespace support in GDB.

--- Additional comment from Florian Weimer on 2020-09-15 13:34:02 UTC ---

Sorry, forgot to mention that jemalloc is linked with -ljemalloc (no LD_PRELOAD).

--- Additional comment from Siddhesh Poyarekar on 2020-10-28 02:12:43 UTC ---

The full command with upstream glibc to reproduce the deadlock in comment 3:

env LD_AUDIT=./libaudit.so \
    GLIBC_TUNABLES=glibc.rtld.optional_static_tls=5120 \
    $builddir/elf/ld.so \
    --library-path $builddir:$builddir/elf:$builddir/nptl \
    ./jemalloc

--- Additional comment from Siddhesh Poyarekar on 2020-12-11 14:19:39 UTC ---

I spent some time debugging this today and the root cause is that jemalloc, when linked in directly, comes into use before pthreads are initialized.  libc.so has symbols for pthread_mutex_lock and pthread_mutex_unlock to take care of that, wherein those operations become nops until the pthreads subsystem is initialized.

The twist in the plot is that jemalloc uses *pthread_mutex_trylock*, which does not have a forwarder in libc.so and actually sets the lock primitive.  Its paired pthread_mutex_unlock is still a nop because it's still too early, thus setting the stage for the deadlock we see.

It's straightforward to fix this by adding a forwarder for pthread_mutex_trylock in libc.so, but on discussion with Florian, we agreed to move all of the mutex functions into libc.so instead, since that's something we want to do anyway.

Comment 1 Fedora Program Management 2021-04-29 16:45:02 UTC
This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 2 Siddhesh Poyarekar 2021-06-01 13:45:44 UTC
This should be fixed in rawhide.  I need to verify it and close this as done.

Comment 3 Siddhesh Poyarekar 2021-06-16 05:37:40 UTC
Fix has been pushed to rawhide.


Note You need to log in before you can comment on or make changes to this bug.