1878932 – glibc: Linking the main program with jemalloc causes sysconf to crash in audit mode

Bug 1878932 - glibc: Linking the main program with jemalloc causes sysconf to crash in audit mode

Summary: glibc: Linking the main program with jemalloc causes sysconf to crash in audi...

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	34
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Siddhesh Poyarekar
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-15 00:06 UTC by Josh Stone
Modified:	2021-06-16 05:41 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glibc-2.33.9000-9.fc35.x86_64
Clone Of:
Clones:	1909920 (view as bug list)
Environment:
Last Closed:	2021-06-16 05:41:17 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Josh Stone 2020-09-15 00:06:44 UTC

Description of problem:
A simple LD_AUDIT library, just la_version, causes a SIGSEGV on upstream rust binaries as of 1.46.0.

Version-Release number of selected component (if applicable):
glibc-2.31-4.fc32.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Download and extract rustc 1.46.0:
$ wget https://static.rust-lang.org/dist/2020-08-27/rustc-1.46.0-unknown-linux-gnu.tar.xz
$ tar xf rustc-1.46.0-unknown-linux-gnu.tar.xz

2. Build a trivial audit library:
$ echo 'unsigned la_version(unsigned version) { return version; }' >audit.c
$ gcc -fPIC -shared audit.c -o libaudit.so

3. Try to audit rustc:
$ LD_AUDIT=./libaudit.so ./rustc-1.46.0-x86_64-unknown-linux-gnu/rustc/bin/rustc -V
Segmentation fault (core dumped)


Additional info:
GDB shows all "??" in its backtrace, but "coredumpctl info" managed:

    Stack trace of thread 8652:
    #0  0x00007ffbdb675fcd get_nprocs (libc.so.6 + 0xfefcd)
    #1  0x00007ffbdb647291 __sysconf (libc.so.6 + 0xd0291)
    #2  0x0000559f8209550d n/a (/tmp/rustc-1.46.0-x86_64-unknown-linux-gnu/rustc/bin/rustc + 0x850d)

Several past versions of upstream rustc that I sampled were similarly affected.
However, 1.47.0-beta.3 and 1.48.0-nightly (7402a3944 2020-09-13) run fine!

The Fedora build of rustc 1.46.0 is not affected.

Comment 1 Florian Weimer 2020-09-15 06:12:39 UTC

(In reply to Josh Stone from comment #0)
> 1. Download and extract rustc 1.46.0:
> $ wget
> https://static.rust-lang.org/dist/2020-08-27/rustc-1.46.0-unknown-linux-gnu.
> tar.xz

This URL is 404 for me. Do you have a replacement?

Comment 2 Florian Weimer 2020-09-15 06:31:21 UTC

https://static.rust-lang.org/dist/rust-1.46.0-x86_64-unknown-linux-gnu.tar.gz worked and reproduces the issue.

This appears to be related to jemalloc.

Comment 3 Florian Weimer 2020-09-15 13:31:39 UTC

On Fedora 33 with the system jemalloc, a simple C program just deadlocks:

(gdb) bt
#0  __lll_lock_wait (futex=0x7ffff76032a8, private=0) at lowlevellock.c:52
#1  0x00007ffff7967763 in __GI___pthread_mutex_lock (mutex=0x7ffff76032a8)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x00007ffff7ba0475 in je_malloc_mutex_lock_slow ()
   from /lib64/libjemalloc.so.2
#3  0x00007ffff7bbeef8 in extent_recycle.isra ()
   from /lib64/libjemalloc.so.2
#4  0x00007ffff7b6c4d7 in arena_bin_malloc_hard.lto_priv ()
   from /lib64/libjemalloc.so.2
#5  0x00007ffff7bc65ed in je_arena_tcache_fill_small.constprop ()
   from /lib64/libjemalloc.so.2
#6  0x00007ffff7b5d558 in je_malloc_default ()
   from /lib64/libjemalloc.so.2
#7  0x00007ffff79fa8e4 in __GI__IO_file_doallocate (
    fp=0x7ffff7b4b520 <_IO_2_1_stdout_>) at filedoalloc.c:101
#8  0x00007ffff7a092a0 in __GI__IO_doallocbuf (
    fp=0x7ffff7b4b520 <_IO_2_1_stdout_>) at libioP.h:948
#9  __GI__IO_doallocbuf (fp=0x7ffff7b4b520 <_IO_2_1_stdout_>)
    at genops.c:342
#10 0x00007ffff7a08438 in _IO_new_file_overflow (
    f=0x7ffff7b4b520 <_IO_2_1_stdout_>, ch=-1) at fileops.c:745
#11 0x00007ffff7a074e6 in _IO_new_file_xsputn (n=4, data=<optimized out>, 
    f=<optimized out>) at libioP.h:948
#12 _IO_new_file_xsputn (f=0x7ffff7b4b520 <_IO_2_1_stdout_>, 
    data=<optimized out>, n=4) at fileops.c:1197
#13 0x00007ffff79f2219 in outstring_func (done=0, length=<optimized out>, 
    string=<optimized out>, s=0x7ffff7b4b520 <_IO_2_1_stdout_>)
    at ../libio/libioP.h:948
#14 __vfprintf_internal (s=0x7ffff7b4b520 <_IO_2_1_stdout_>, 
    format=0x402010 "%ld\n", ap=0x7fffffffdcc0, mode_flags=0)
    at vfprintf-internal.c:1646
#15 0x00007ffff79de4af in __printf (format=<optimized out>) at printf.c:33
#16 0x0000000000401156 in main ()

C sources:

#include <unistd.h>
#include <stdio.h>

int
main (void)
{
  printf ("%ld\n", sysconf (_SC_PAGESIZE));
}

Debugging this is difficult because of the lack of audit namespace support in GDB.

Comment 4 Florian Weimer 2020-09-15 13:34:02 UTC

Sorry, forgot to mention that jemalloc is linked with -ljemalloc (no LD_PRELOAD).

Comment 5 Josh Stone 2020-09-15 17:16:38 UTC

(In reply to Florian Weimer from comment #1)
> This URL is 404 for me. Do you have a replacement?

Sorry about that -- somehow I omitted "x86_64" from the triple in the filename.
https://static.rust-lang.org/dist/2020-08-27/rustc-1.46.0-x86_64-unknown-linux-gnu.tar.xz

The URL you picked from the dist root is also fine, should be the same thing.

Comment 7 Siddhesh Poyarekar 2020-10-27 10:08:03 UTC

This seems to be a build issue with rust; the only builds that I could get to fail like this were the pre-built 1.46.0 and older.  I tried the following and all of them worked just fine:

1. rustc built off the 1.46.0, 1.47.0 and 1.48.0 tags as well as the master branch without jemalloc
2. rustc built off the 1.46.0, 1.47.0 and 1.48.0 tags as well as the master branch with jemalloc
3. rustc 1.47.0 prebuilt

Something seems to have changed in the upstream build environment to cause that change.

Comment 8 Florian Weimer 2020-10-27 10:11:41 UTC

(In reply to Siddhesh Poyarekar from comment #7)
> This seems to be a build issue with rust; the only builds that I could get
> to fail like this were the pre-built 1.46.0 and older.

Yes, it's related to linking with jemalloc, see comment 3 and comment 4.

Comment 10 Siddhesh Poyarekar 2020-10-28 02:12:43 UTC

The full command with upstream glibc to reproduce the deadlock in comment 3:

env LD_AUDIT=./libaudit.so \
    GLIBC_TUNABLES=glibc.rtld.optional_static_tls=5120 \
    $builddir/elf/ld.so \
    --library-path $builddir:$builddir/elf:$builddir/nptl \
    ./jemalloc

Comment 11 Siddhesh Poyarekar 2020-12-11 14:19:39 UTC

I spent some time debugging this today and the root cause is that jemalloc, when linked in directly, comes into use before pthreads are initialized.  libc.so has symbols for pthread_mutex_lock and pthread_mutex_unlock to take care of that, wherein those operations become nops until the pthreads subsystem is initialized.

The twist in the plot is that jemalloc uses *pthread_mutex_trylock*, which does not have a forwarder in libc.so and actually sets the lock primitive.  Its paired pthread_mutex_unlock is still a nop because it's still too early, thus setting the stage for the deadlock we see.

It's straightforward to fix this by adding a forwarder for pthread_mutex_trylock in libc.so, but on discussion with Florian, we agreed to move all of the mutex functions into libc.so instead, since that's something we want to do anyway.

Comment 12 Josh Stone 2020-12-11 17:26:56 UTC

The original issue had SIGSEGV, not a deadlock -- are you confident that it's the same root cause?

Comment 13 Siddhesh Poyarekar 2020-12-13 15:19:54 UTC

(In reply to Josh Stone from comment #12)
> The original issue had SIGSEGV, not a deadlock -- are you confident that
> it's the same root cause?

Sorry I forgot about the original report :/  I'm pretty sure that it's not the same root cause - your report looks build environment induced - but I'll take a closer look to be 100% sure after which I'll file a separate bug for the deadlock.

Comment 14 Siddhesh Poyarekar 2020-12-22 03:28:45 UTC

Confirmed it is a distinct issue from the deadlock, although the underlying cause is the same - jemalloc getting initialized before libc.so is.  This time it is isspace() that causes problems, trying to read into TLS (__libc_tsd_CTYPE_B) when it has not yet been set up.  I've taken this bug back on me and have cloned a separate bug for the deadlock that Florian is working on fixing by moving pthread functions to libc.so.

Now to figure out a way to make a minimal reproducer out of this because I still can't get a built rust to behave the same way even with jemalloc enabled.

Comment 15 Siddhesh Poyarekar 2020-12-22 07:02:41 UTC

So this is what the jemalloc code says:

---
static unsigned                                                                 
malloc_ncpus(void) {                                                            
        long result;                                                            
                                                                                
#ifdef _WIN32                                                                   
        SYSTEM_INFO si;                                                         
        GetSystemInfo(&si);                                                     
        result = si.dwNumberOfProcessors;                                       
#elif defined(JEMALLOC_GLIBC_MALLOC_HOOK) && defined(CPU_COUNT)                 
        /*                                                                      
         * glibc >= 2.6 has the CPU_COUNT macro.                                
         *                                                                      
         * glibc's sysconf() uses isspace().  glibc allocates for the first time
         * *before* setting up the isspace tables.  Therefore we need a         
         * different method to get the number of CPUs.                          
         */                                                                     
        {                                                                       
                cpu_set_t set;                                                  
                                                                                
                pthread_getaffinity_np(pthread_self(), sizeof(set), &set);      
                result = CPU_COUNT(&set);                                       
        }                                                                       
#else                                                                           
        result = sysconf(_SC_NPROCESSORS_ONLN);                                 
#endif                                                                          
        return ((result == -1) ? 1 : (unsigned)result);                         
}                                                                               
---

so the call to sysconf technically should not get used when building for Linux with glibc.  However it's evident from the binary that sysconf is called, so the build environment somehow makes this happen.  Josh, would you be able to track down how 1.46.0 got configured and built?  Given that 1.47 and later don't have this problem, it seems like whoever did the 1.46.0 build, did so in an environment that either did not have CPU_COUNT or ended up with JEMALLOC_GLIBC_MALLOC_HOOK undefined.

Comment 17 Josh Stone 2021-01-05 18:02:32 UTC

(In reply to Siddhesh Poyarekar from comment #15)
> So this is what the jemalloc code says:
...
>          * glibc >= 2.6 has the CPU_COUNT macro.                            
...
> so the call to sysconf technically should not get used when building for
> Linux with glibc.  However it's evident from the binary that sysconf is
> called, so the build environment somehow makes this happen.  Josh, would you
> be able to track down how 1.46.0 got configured and built?  Given that 1.47
> and later don't have this problem, it seems like whoever did the 1.46.0
> build, did so in an environment that either did not have CPU_COUNT or ended
> up with JEMALLOC_GLIBC_MALLOC_HOOK undefined.

Aha! Up to 1.46, Rust upstream was using a CentOS 5 environment for its minimum compatibility, which had glibc 2.5.
Rust 1.47 updated that baseline to glibc 2.11: https://github.com/rust-lang/rust/pull/74163/

Comment 18 Siddhesh Poyarekar 2021-01-06 03:57:46 UTC

Perfect, that mystery is solved then :)

We'd still like to explore options to make the auditor more robust, so I'll keep this open until we file an upstream bug.

Comment 19 Fedora Program Management 2021-04-29 16:37:47 UTC

This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 20 Siddhesh Poyarekar 2021-06-16 05:41:17 UTC

This appears to have been fixed in rawhide too, probably with the same fix as bug 1909920.

Note You need to log in before you can comment on or make changes to this bug.