Bug 1400347 - glibc: crash in mpfr due to incorrect result of dynamic TLS resolution (aarch64)
Summary: glibc: crash in mpfr due to incorrect result of dynamic TLS resolution (aarch64)
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: rawhide
Hardware: aarch64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Florian Weimer
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords:
: 1400348 (view as bug list)
Depends On:
Blocks: ARMTracker
TreeView+ depends on / blocked
 
Reported: 2016-11-30 23:37 UTC by Rex Dieter
Modified: 2016-12-02 17:17 UTC (History)
14 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2016-12-02 17:17:50 UTC


Attachments (Terms of Use)

Description Rex Dieter 2016-11-30 23:37:29 UTC
http://koji.fedoraproject.org/koji/taskinfo?taskID=16685941

make[2]: Entering directory '/builddir/build/BUILD/automoc4-0.9.88/aarch64-redhat-linux-gnu'
[ 50%] Building CXX object CMakeFiles/automoc4.dir/kde4automoc.cpp.o
/usr/bin/c++   -DQT_CORE_LIB -DQT_NO_DEBUG -I/builddir/build/BUILD/automoc4-0.9.88/aarch64-redhat-linux-gnu -I/builddir/build/BUILD/automoc4-0.9.88 -isystem /usr/include/QtCore  -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -Wnon-virtual-dtor -Wno-long-long -ansi -Wundef -Wcast-align -Wchar-subscripts -Wall -W -Wpointer-arith -Wformat-security -fno-check-new -fno-common -O2 -DNDEBUG   -o CMakeFiles/automoc4.dir/kde4automoc.cpp.o -c /builddir/build/BUILD/automoc4-0.9.88/kde4automoc.cpp
In file included from /usr/include/QtCore/qnamespace.h:45:0,
                 from /usr/include/QtCore/qobjectdefs.h:45,
                 from /usr/include/QtCore/qobject.h:47,
                 from /usr/include/QtCore/qcoreapplication.h:45,
                 from /usr/include/QtCore/QCoreApplication:1,
                 from /builddir/build/BUILD/automoc4-0.9.88/kde4automoc.cpp:26:
/usr/include/QtCore/qglobal.h:1315:1: internal compiler error: Bus error
 { return d >= qreal(0.0) ? int(d + qreal(0.5)) : int(d - int(d-1) + qreal(0.5)) + int(d-1); }
 ^
/usr/include/QtCore/qglobal.h:1315:1: internal compiler error: Segmentation fault
c++: internal compiler error: Segmentation fault (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugzilla.redhat.com/bugzilla> for instructions.

Comment 2 Florian Weimer 2016-12-01 09:00:22 UTC
*** Bug 1400348 has been marked as a duplicate of this bug. ***

Comment 3 Florian Weimer 2016-12-01 12:24:39 UTC
Triggered by glibc-aarch64-tls-fixes.patch:

diff --git a/sysdeps/aarch64/dl-machine.h b/sysdeps/aarch64/dl-machine.h
index 997c860..50b37b0 100644
--- a/sysdeps/aarch64/dl-machine.h
+++ b/sysdeps/aarch64/dl-machine.h
@@ -295,7 +295,7 @@ elf_machine_rela (struct link_map *map, const ElfW(Rela) *reloc,
 # ifndef SHARED
                CHECK_STATIC_TLS (map, sym_map);
 # else
-               if (!TRY_STATIC_TLS (map, sym_map))
+               if (1)
                  {
                    td->arg = _dl_make_tlsdesc_dynamic
                      (sym_map, sym->st_value + reloc->r_addend);

I'm not sure yet if this change makes it easier to trigger the bug, or if it is the root cause.

Comment 4 Florian Weimer 2016-12-01 13:54:15 UTC
Introduced upstream by:

commit 389d1f1b232b3d6b9d73ee2c50e543ace6675621
Author: Steve Ellcey <sellcey@caviumnetworks.com>
Date:   Mon Nov 28 09:01:23 2016 -0800

    Partial ILP32 support for aarch64.

elf/dl-tlsdesc.os changes this way:

 0000000000000030 <_dl_tlsdesc_dynamic>:
   30:  a9bc7bfd        stp     x29, x30, [sp,#-64]!
   34:  910003fd        mov     x29, sp
   38:  a9020be1        stp     x1, x2, [sp,#32]
   3c:  a90313e3        stp     x3, x4, [sp,#48]
   40:  d53bd044        mrs     x4, tpidr_el0
   44:  c8dffc1f        ldar    xzr, [x0]
   48:  f9400401        ldr     x1, [x0,#8]
   4c:  f9400080        ldr     x0, [x4]
   50:  f9400823        ldr     x3, [x1,#16]
   54:  f9400002        ldr     x2, [x0]
   58:  eb02007f        cmp     x3, x2
   5c:  540001a8        b.hi    90 <_dl_tlsdesc_dynamic+0x60>
   60:  f9400022        ldr     x2, [x1]
   64:  8b021000        add     x0, x0, x2, lsl #4
   68:  f9400000        ldr     x0, [x0]
   6c:  b100041f        cmn     x0, #0x1
   70:  54000100        b.eq    90 <_dl_tlsdesc_dynamic+0x60>
-  74:  f9400421        ldr     x1, [x1,#8]
+  74:  f9400821        ldr     x1, [x1,#16]
   78:  8b010000        add     x0, x0, x1
   7c:  cb040000        sub     x0, x0, x4
   80:  a9420be1        ldp     x1, x2, [sp,#32]
   84:  a94313e3        ldp     x3, x4, [sp,#48]
   88:  a8c47bfd        ldp     x29, x30, [sp],#64
   8c:  d65f03c0        ret
   90:  a9b91be5        stp     x5, x6, [sp,#-112]!
   94:  a90123e7        stp     x7, x8, [sp,#16]
   98:  a9022be9        stp     x9, x10, [sp,#32]
   9c:  a90333eb        stp     x11, x12, [sp,#48]
   a0:  a9043bed        stp     x13, x14, [sp,#64]
   a4:  a90543ef        stp     x15, x16, [sp,#80]
   a8:  a9064bf1        stp     x17, x18, [sp,#96]
   ac:  adb007e0        stp     q0, q1, [sp,#-512]!
   b0:  ad010fe2        stp     q2, q3, [sp,#32]
   b4:  ad0217e4        stp     q4, q5, [sp,#64]
   b8:  ad031fe6        stp     q6, q7, [sp,#96]
   bc:  ad0427e8        stp     q8, q9, [sp,#128]
   c0:  ad052fea        stp     q10, q11, [sp,#160]
   c4:  ad0637ec        stp     q12, q13, [sp,#192]


The cause is a faulty hunk in the change to sysdeps/aarch64/dl-tlsdesc.S:

-	ldr	x1, [x1,#8]
-	add	x0, x0, x1
-	sub	x0, x0, x4
+	ldr	PTR_REG (1), [x1,#(PTR_SIZE * 2)]
+	add	PTR_REG (0), PTR_REG (0), PTR_REG (1)
+	sub	PTR_REG (0), PTR_REG (0), PTR_REG (4)

“#(PTR_SIZE * 2)” should be “#PTR_SIZE”.

Comment 5 Carlos O'Donell 2016-12-01 16:49:55 UTC
(In reply to Florian Weimer from comment #3)
> Triggered by glibc-aarch64-tls-fixes.patch:
> 
> diff --git a/sysdeps/aarch64/dl-machine.h b/sysdeps/aarch64/dl-machine.h
> index 997c860..50b37b0 100644
> --- a/sysdeps/aarch64/dl-machine.h
> +++ b/sysdeps/aarch64/dl-machine.h
> @@ -295,7 +295,7 @@ elf_machine_rela (struct link_map *map, const ElfW(Rela)
> *reloc,
>  # ifndef SHARED
>                 CHECK_STATIC_TLS (map, sym_map);
>  # else
> -               if (!TRY_STATIC_TLS (map, sym_map))
> +               if (1)
>                   {
>                     td->arg = _dl_make_tlsdesc_dynamic
>                       (sym_map, sym->st_value + reloc->r_addend);
> 
> I'm not sure yet if this change makes it easier to trigger the bug, or if it
> is the root cause.

This patch should be removed. I'm surprised it wasn't removed by me when we merged in Alex's changes to the DTV resize handling.

The use of TLS descriptors by AArch64 means that during relocation the dynamic loader has option to use any remaining static TLS image space _or_ use the dynamic space allocated by malloc (attached to the link map via tlsdesc_table).

Avoiding the use of the TRY_STATIC_TLS macro is intentional because it forces any dynamically loaded module to _try_ to be loaded into static TLS. Despite the actual source-level comments _dl_try_allocate_static_tls returns -1 when it fails to move the module into static TLS.

One of the failures we were commonly seeing with AArch64, because of this behaviour, is that we would run into conservative heuristics for expanding the DTV to handle new modules that contained static TLS. Instead the code would simply fail the load, and that was bad, because we should try harder to fit any such modules into all remaining static TLS space.

However, Alex's changes changed the heuristics to accurately try to fit as many dlopen'd modules as possible into the static TLS, expanding the DTV as required, but we failed to remove this patch which makes TLS descriptors use residual static TLS space again.

We should remove this patch in Rawhide.

Comment 6 Carlos O'Donell 2016-12-01 16:56:03 UTC
(In reply to Carlos O'Donell from comment #5)
> (In reply to Florian Weimer from comment #3)
> > Triggered by glibc-aarch64-tls-fixes.patch:

It might be that in the general case there is _always_ enough residual space in the static TLS block to account for all TLS descriptors and therefore we have never had to use _dl_tlsdesc_dynamic.

We need a test case that exercises TLS by creating thousands of such variables such that we trigger the fallback path to dynamic handling of the descriptor.

Comment 7 Florian Weimer 2016-12-02 09:00:52 UTC
(In reply to Carlos O'Donell from comment #6)
> (In reply to Carlos O'Donell from comment #5)
> > (In reply to Florian Weimer from comment #3)
> > > Triggered by glibc-aarch64-tls-fixes.patch:
> 
> It might be that in the general case there is _always_ enough residual space
> in the static TLS block to account for all TLS descriptors and therefore we
> have never had to use _dl_tlsdesc_dynamic.
> 
> We need a test case that exercises TLS by creating thousands of such
> variables such that we trigger the fallback path to dynamic handling of the
> descriptor.

A test case appears to be difficult.  Many global-dynamic __thread variables (10,000, to be precise) do not trigger the bug.  The test case even verifies that the variables are separately allocated.  This could be do the way the dl-minimal.c allocator works.

Comment 8 Florian Weimer 2016-12-02 11:26:31 UTC
I have found a reproducer which works with upstream glibc.

Comment 9 Florian Weimer 2016-12-02 17:17:50 UTC
The recent regression is fixed in rawhide.

aarch64 TLS is still broken upstream:

https://sourceware.org/bugzilla/show_bug.cgi?id=20915

But the faulty commit has been in rawhide for a while, so an immediate fix is not needed.  Marked as a release blocker upstream, so it will hopefully be fixed before the next Fedora release.


Note You need to log in before you can comment on or make changes to this bug.