Bug 1398716

Summary: glibc: ld.so relocation processing in forward dependency order breaks IFUNC resolvers
Product: Red Hat Enterprise Linux 6 Reporter: Patrik Kis <pkis>
Component: glibcAssignee: glibc team <glibc-bugzilla>
Status: CLOSED WONTFIX QA Contact: qe-baseos-tools-bugs
Severity: medium Docs Contact:
Priority: high    
Version: 6.9CC: aheverle, ashankar, codonell, dlavu, fweimer, grajaiya, jhrozek, jistone, kcleveng, lslebodn, mkosek, mnewsome, mzidek, pbrezina, pfrankli, pkis, sbose
Target Milestone: rcKeywords: Patch
Target Release: ---   
Hardware: ppc64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-18 15:37:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ld debug of sssd start none

Description Patrik Kis 2016-11-25 16:34:20 UTC
Description of problem:
Could you please help investigate the following case?

The issue is that sssd does not start when ipa-client-install is executed just after ppc64 machine install. The reason why it does not start is that sssd_pac crashes. Abrt caught the core dump. I also managed to start sssd with debug bug there is nothing for sssd_pac.

The issue happen only on ppc64 machines, however I'm not entirely sure it's architecture specific, the core dump suggests it's again time sync related. It does not happen when the ipa-client-install is executed after some time (after an hour or two it is always passing), that also supports the theory about time diff.


Version-Release number of selected component (if applicable):
sssd-1.13.3-45.el6.ppc64

How reproducible:
always

Steps to Reproduce:
1. install a fresh ppc64 machine 
2. run the test  /CoreOS/authconfig/Regression/bz975203-Authconfig-does-not-restart-SSSD-or-Winbind-when

Actual results:
sssd_pac crash

Expected results:
sssd start

Additional info:

The backtrace of the core file:

(gdb) bt
#0  _dl_vdso_vsym (name=0x191e80 <Address 0x191e80 out of bounds>, vers=0xfffd1c6eb80) at ../sysdeps/unix/sysv/linux/dl-vdso.c:28
#1  0x00000fffa17237cc in gettimeofday_ifunc () at ../sysdeps/unix/sysv/linux/powerpc/gettimeofday.c:41
#2  0x00000fffa2404ef4 in resolve_ifunc (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at ../sysdeps/powerpc/powerpc64/dl-machine.h:551
#3  elf_machine_rela (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at ../sysdeps/powerpc/powerpc64/dl-machine.h:587
#4  elf_dynamic_do_rela (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at do-rel.h:120
#5  _dl_relocate_object (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at dl-reloc.c:268
#6  0x00000fffa23f8a34 in dl_main (phdr=<value optimized out>, phnum=<value optimized out>, user_entry=<value optimized out>, auxv=<value optimized out>) at rtld.c:2262
#7  0x00000fffa2411418 in _dl_sysdep_start (start_argptr=<value optimized out>, dl_main=@0xfffa2430148: 0xfffa23f76e0 <dl_main>) at ../elf/dl-sysdep.c:244
#8  0x00000fffa23f5e64 in _dl_start_final (arg=0xfffd1c6f630, info=<value optimized out>) at rtld.c:335
#9  0x00000fffa23f614c in _dl_start (arg=0xfffd1c6f630) at rtld.c:563
#10 0x00000fffa23f5510 in ._start () from /lib64/ld64.so.1
(gdb) bt f
#0  _dl_vdso_vsym (name=0x191e80 <Address 0x191e80 out of bounds>, vers=0xfffd1c6eb80) at ../sysdeps/unix/sysv/linux/dl-vdso.c:28
        map = <value optimized out>
        value = <value optimized out>
#1  0x00000fffa17237cc in gettimeofday_ifunc () at ../sysdeps/unix/sysv/linux/powerpc/gettimeofday.c:41
        linux2615 = {name = 0x191e70 <Address 0x191e70 out of bounds>, hash = 123718565, hidden = 1, filename = 0x0}
        vdso_gettimeofday = <value optimized out>
#2  0x00000fffa2404ef4 in resolve_ifunc (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at ../sysdeps/powerpc/powerpc64/dl-machine.h:551
        opd = {fd_func = 17590599694224, fd_toc = 17590600809344, fd_aux = 866336}
#3  elf_machine_rela (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at ../sysdeps/powerpc/powerpc64/dl-machine.h:587
        refsym = 0xfffa1010610
        value = <value optimized out>
        reloc_addr = 0xfffa1030890
        r_type = 21
        sym_map = <value optimized out>
#4  elf_dynamic_do_rela (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at do-rel.h:120
        ndx = <value optimized out>
        version = 0xfffa1010e76
        symtab = 0xfffa10102e0
        relative = <value optimized out>
        r = 0xfffa1012130
        end = 0xfffa1012298
        l_addr = 17590592274432
#5  _dl_relocate_object (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at dl-reloc.c:268
        ranges_index = <value optimized out>
        ranges = {{start = 17590592278536, size = 4752, lazy = 0}, {start = 17590591492304, size = 0, lazy = 1}}
        edr_lazy = <value optimized out>
        textrels = 0x0
        errstring = 0x0
        lazy = <value optimized out>
#6  0x00000fffa23f8a34 in dl_main (phdr=<value optimized out>, phnum=<value optimized out>, user_entry=<value optimized out>, auxv=<value optimized out>) at rtld.c:2262
        lnp = <value optimized out>
        consider_profiling = 0
        start = 21545157465003374
        stop = <value optimized out>
        l = 0xfffa243f928
        ph = <value optimized out>
        mode = normal
        main_map = 0xfffa2432a70
        file_size = 0
        file = <value optimized out>
        has_interp = <value optimized out>
        i = <value optimized out>
        prelinked = <value optimized out>
        rtld_is_main = <value optimized out>
        start = <value optimized out>
        stop = <value optimized out>
        diff = <value optimized out>
        tcbp = 0xfffa10081b0
        first_preload = 0xfffa2433018
        r = 0x4c8b34e154d56e
        rtld_ehdr = <value optimized out>
        rtld_phdr = <value optimized out>
        cnt = <value optimized out>
        count_modids = 0
        preloads = <value optimized out>
        npreloads = <value optimized out>
        preload_file = "/etc/ld.so.preload"
        rtld_multiple_ref = true
        was_tls_init_tp_called = false
#7  0x00000fffa2411418 in _dl_sysdep_start (start_argptr=<value optimized out>, dl_main=@0xfffa2430148: 0xfffa23f76e0 <dl_main>) at ../elf/dl-sysdep.c:244
        phdr = <value optimized out>
        phnum = <value optimized out>
        user_entry = 268573936
        av = <value optimized out>
#8  0x00000fffa23f5e64 in _dl_start_final (arg=0xfffd1c6f630, info=<value optimized out>) at rtld.c:335
---Type <return> to continue, or q <return> to quit---
        start_addr = <value optimized out>
        rtld_total_time = <value optimized out>
#9  0x00000fffa23f614c in _dl_start (arg=0xfffd1c6f630) at rtld.c:563
        info = {l = {l_addr = 17590613114880, l_name = 0x0, l_ld = 0xfffa242fe60, l_next = 0x0, l_prev = 0x0, l_real = 0x0, l_ns = 0, l_libname = 0x0, l_info = {0x0, 0x0, 0xfffa242fee0, 0xfffa242fed0, 0xfffa242fe70, 0xfffa242fe90, 
    0xfffa242fea0, 0xfffa242ff20, 0xfffa242ff30, 0xfffa242ff40, 0xfffa242feb0, 0xfffa242fec0, 0x0, 0x0, 0xfffa242fe60, 0x0, 0x0, 0x0, 0x0, 0x0, 0xfffa242fef0, 0x0, 0x0, 0xfffa242ff00, 0xfffa242ff70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
    0x0, 0x0, 0xfffa242ff10, 0x0, 0x0, 0x0, 0x0, 0xfffa242ff60, 0xfffa242ff50, 0xfffa242ff80, 0x0, 0xfffa242ffa0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xfffa242ff90, 0x0 <repeats 25 times>, 0xfffa242fe80}, l_phdr = 0x0, l_entry = 0, 
            l_phnum = 0, l_ldnum = 0, l_searchlist = {r_list = 0x0, r_nlist = 0}, l_symbolic_searchlist = {r_list = 0x0, r_nlist = 0}, l_loader = 0x0, l_versions = 0x0, l_nversions = 0, l_nbuckets = 0, l_gnu_bitmask_idxbits = 0, 
            l_gnu_shift = 0, l_gnu_bitmask = 0x0, {l_gnu_buckets = 0x0, l_chain = 0x0}, {l_gnu_chain_zero = 0x0, l_buckets = 0x0}, l_direct_opencount = 0, l_type = lt_executable, l_relocated = 1, l_init_called = 0, l_global = 0, 
            l_reserved = 0, l_phdr_allocated = 0, l_soname_added = 0, l_faked = 0, l_need_tls_init = 0, l_auditing = 0, l_audit_any_plt = 0, l_removed = 0, l_contiguous = 0, l_symbolic_in_local_scope = 0, l_free_initfini = 0, 
            l_rpath_dirs = {dirs = 0x0, malloced = 0}, l_reloc_result = 0x0, l_versyms = 0x0, l_origin = 0x0, l_map_start = 0, l_map_end = 0, l_text_end = 0, l_scope_mem = {0x0, 0x0, 0x0, 0x0}, l_scope_max = 0, l_scope = 0x0, 
            l_local_scope = {0x0, 0x0}, l_dev = 0, l_ino = 0, l_runpath_dirs = {dirs = 0x0, malloced = 0}, l_initfini = 0x0, l_reldeps = 0x0, l_reldepsmax = 0, l_used = 0, l_feature_1 = 0, l_flags_1 = 0, l_flags = 0, l_idx = 0, l_mach = 
    {<No data fields>}, l_lookup_cache = {sym = 0x0, type_class = 0, value = 0x0, ret = 0x0}, l_tls_initimage = 0x0, l_tls_initimage_size = 0, l_tls_blocksize = 0, l_tls_align = 0, l_tls_firstbyte_offset = 0, l_tls_offset = -1, 
            l_tls_modid = 0, l_relro_addr = 0, l_relro_size = 0, l_serial = 0, l_audit = 0xfffd1c6f0f0}, start_time = 21545157452809918}
#10 0x00000fffa23f5510 in ._start () from /lib64/ld64.so.1
No symbol table info available.
(gdb)

Comment 5 Sumit Bose 2016-11-25 17:28:21 UTC
If I read the backtrace correctly the crash happened while ld64.so is preparing the runtime environment. Reassigning to glibc.

Comment 6 Florian Weimer 2016-11-25 17:51:04 UTC
Is there a simple command I can run to reproduce this issue?

I would like to see the output with the LD_DEBUG=all environment variable set.

Comment 8 Patrik Kis 2016-11-28 10:31:55 UTC
Created attachment 1225227 [details]
ld debug of sssd start

(In reply to Florian Weimer from comment #6)
> Is there a simple command I can run to reproduce this issue?
> 
I can reproduce it only with my test and only on a freshly installed machine.

> I would like to see the output with the LD_DEBUG=all environment variable
> set.

Attached are the debug files from sssd start. The PID of crashed process was 6546.
Let me know if you need other traces.

Comment 9 Carlos O'Donell 2016-12-05 19:44:49 UTC
The crash looks like it happens when libsamba-debug-samba4.so is being relocated 

      6546:	relocation processing: /usr/lib64/samba/libsamba-debug-samba4.so
...
      6546:	binding file /usr/lib64/samba/libsamba-debug-samba4.so [0] to /lib64/libc.so.6 [0]: normal symbol `gettimeofday' [GLIBC_2.3]

This should continue on via the IFUNC resolver to lookup __kernel_gettimeofday and bind that to the kernel vDSO version of the function, but it doesn't seem like we get that far.

In fact from the line nubmers we crash referencing GLRO(dl_sysinfo_map), which should be the link map given to us by the kernel.

If I had to guess you're running an old kernel with broken vDSO support, but you did say "install fresh" so I assume this is a rhel-6.9 system running a rhel-6.9 kernel.

We will need the machine setup again that reproduces the issue.

Comment 13 Carlos O'Donell 2016-12-15 03:05:57 UTC
(In reply to Patrik Kis from comment #12)
> I might found another test where this issue appears.
> 
> It's a test where slapd from openldap is executed with valgrind. I see
> illegal opcode errors in libpthread. The reason why I think it may be
> related is that it also happens only on ppc64 and the problem disappears
> after about a hour.
> 
> I hope it can help to find the root cause. Please let me know if you need
> more info.

> ==9135== valgrind: Unrecognised instruction at address 0x507a858.
> ==9135==    at 0x507A858: ??? (in /lib64/libpthread-2.12.so)
> ==9135==    by 0x507A86F: nptl_freeres (in /lib64/libpthread-2.12.so)
> ==9135==    by 0x5251633: freeres_libptread (in /lib64/libc-2.12.so)
> ==9135==    by 0x52512EF: __libc_freeres (in /lib64/libc-2.12.so)
> ==9135==    by 0x48509C3: _vgnU_freeres (vg_preloaded.c:62)
> ==9135==    by 0x5138F03: exit (exit.c:93)
> ==9135==    by 0x511C0EF: (below main) (libc-start.c:258)

Everything in this sequence is normal except for the fact that nptl_freeres appears to jump into the middle of nowhere. This is exceptionally odd, even odder is that it goes away, indicative of a concurrency issue. I would initially suspect a faulty test case that ends up calling exit() concurrently from two threads, but that's just a guess.

Can you get me access to the box and the exact commands that reproduce the issue and the sources for those commands?

Comment 23 Florian Weimer 2017-06-06 16:42:56 UTC
(In reply to Patrik Kis from comment #20)
> I'm providing here the reproducer for booth cases I have found so far. All
> cases are reproducible for 1-2 hours after the machine was provisioned.
> Sorry I have not posted it here earlier.

Patrik, could you check if disabling prelinking makes a difference to this issue?  If it's somehow prelink-related, that would explain the timing aspect.

Comment 24 Patrik Kis 2017-06-07 15:55:51 UTC
(In reply to Florian Weimer from comment #23)
> (In reply to Patrik Kis from comment #20)
> > I'm providing here the reproducer for booth cases I have found so far. All
> > cases are reproducible for 1-2 hours after the machine was provisioned.
> > Sorry I have not posted it here earlier.
> 
> Patrik, could you check if disabling prelinking makes a difference to this
> issue?  If it's somehow prelink-related, that would explain the timing
> aspect.

Can't see any difference, provided that I correctly disabled prelink, what I'm not sure. What I did is:

# vim /etc/sysconfig/prelink
# grep PRELINKING /etc/sysconfig/prelink
PRELINKING=no
# prelink -ua

even rebooted the machine and my test is still failing the same way as before.

Comment 25 Sumit Bose 2017-06-28 14:49:00 UTC
There is a report on the freeipa-users list where this issue was seen while upgrading from RHEL-6.6 to RHEL-6.7, please see https://lists.fedoraproject.org/archives/list/freeipa-users@lists.fedorahosted.org/message/PJ2EDIPJ4EW2XNVRFORZBZPHVXINMYR4/ for details.

Comment 26 aheverle 2017-06-30 19:28:56 UTC
*** Bug 1466897 has been marked as a duplicate of this bug. ***

Comment 29 Florian Weimer 2017-06-30 20:54:34 UTC
(In reply to Patrik Kis from comment #24)
> Can't see any difference, provided that I correctly disabled prelink, what
> I'm not sure. What I did is:
> 
> # vim /etc/sysconfig/prelink
> # grep PRELINKING /etc/sysconfig/prelink
> PRELINKING=no
> # prelink -ua
> 
> even rebooted the machine and my test is still failing the same way as
> before.

I was wondering if disabling prelink makes the test fail consistently.  You previously said that it miraculously started working after a while.  Thanks.

Comment 30 Florian Weimer 2017-06-30 21:33:13 UTC
(In reply to Patrik Kis from comment #8)
> Attached are the debug files from sssd start. The PID of crashed process was
> 6546.

Here are all the “relocation processing” lines from that file.

    855:      6546:	relocation processing: /lib64/libfreebl3.so (lazy)
    995:      6546:	relocation processing: /lib64/libz.so.1 (lazy)
   1377:      6546:	relocation processing: /usr/lib64/libsasl2.so.2 (lazy)
   4158:      6546:	relocation processing: /lib64/libresolv.so.2 (lazy)
   4369:      6546:	relocation processing: /usr/lib64/libpath_utils.so.1 (lazy)
   4506:      6546:	relocation processing: /lib64/libcrypt.so.1 (lazy)
   4705:      6546:	relocation processing: /usr/lib64/samba/libreplace-samba4.so
   5058:      6546:	relocation processing: /usr/lib64/samba/libsocket-blocking-samba4.so
   5312:      6546:	relocation processing: /usr/lib64/samba/libsamba-debug-samba4.so

Curiously, it's not showing relocation processing for libc.so.6.  Something could be wrong with the DSO dependency sorting (although libc.so.6 does come last on the search path).  The lack of libc.so.6 relocation processing explains why GLRO(dl_sysinfo_map) symbol has not been relocated when the IFUNC resolver runs (see comment 9).

Comment 32 Florian Weimer 2017-06-30 21:54:52 UTC
The order of relocation processing may have been fixed upstream using this commit:

commit 2bc174332ba6ddbd1b855dced33889bef56e8ba3
Author: Andreas Schwab <schwab>
Date:   Tue Aug 30 15:37:54 2011 +0200

    Relocate objects in dependency order

Prelink works around this issue, which explains the timing dependency.

Comment 39 Florian Weimer 2017-07-14 05:47:26 UTC
*** Bug 1425587 has been marked as a duplicate of this bug. ***

Comment 40 Florian Weimer 2017-07-14 05:54:47 UTC
It is likely that *all* binaries which are linked with -z now (BIND_NOW) and use gettimeofday trigger this issue.  This means that using BIND_NOW on Red Hat Enterprise Linux 6 cannot be recommended until this bug is fixed.

Comment 41 Josh Stone 2017-07-14 16:20:47 UTC
Do you recommend against BIND_NOW for all arches, or just ppc64?
(I see that the commit in comment 32 is not arch-specific.)

Comment 43 Florian Weimer 2017-07-24 11:13:07 UTC
(In reply to Josh Stone from comment #41)
> Do you recommend against BIND_NOW for all arches, or just ppc64?
> (I see that the commit in comment 32 is not arch-specific.)

gettimeofday is not an IFUNC on many architectures, so that part of comment 40 is architecture-specific.  In addition, I couldn't find any IFUNC resolvers on x86-64 which require relocations, so using BIND_NOW on x86-64 should be fairly safe (by accident).  However, the generic issue affects all architectures, and could also occur with other libraries which use IFUNC resolvers (not just glibc).

Comment 46 Florian Weimer 2017-09-18 15:37:31 UTC
Red Hat Enterprise Linux is currently in Production Phase 3 which means that
only urgent priority bug fixes will be considered.

We will not be enhancing the dynamic loader to support BIND_NOW with IFUNC relocations.  (A key part of this work has not yet been implemented upstream yet.)  Instead, we have suggested to the Samba developers to rebuild Samba without BIND_NOW.  See bug 1492780.