Bug 1398716
Summary: | glibc: ld.so relocation processing in forward dependency order breaks IFUNC resolvers | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Patrik Kis <pkis> | ||||
Component: | glibc | Assignee: | glibc team <glibc-bugzilla> | ||||
Status: | CLOSED WONTFIX | QA Contact: | qe-baseos-tools-bugs | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.9 | CC: | aheverle, ashankar, codonell, dlavu, fweimer, grajaiya, jhrozek, jistone, kcleveng, lslebodn, mkosek, mnewsome, mzidek, pbrezina, pfrankli, pkis, sbose | ||||
Target Milestone: | rc | Keywords: | Patch | ||||
Target Release: | --- | ||||||
Hardware: | ppc64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-09-18 15:37:31 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Patrik Kis
2016-11-25 16:34:20 UTC
If I read the backtrace correctly the crash happened while ld64.so is preparing the runtime environment. Reassigning to glibc. Is there a simple command I can run to reproduce this issue? I would like to see the output with the LD_DEBUG=all environment variable set. Created attachment 1225227 [details] ld debug of sssd start (In reply to Florian Weimer from comment #6) > Is there a simple command I can run to reproduce this issue? > I can reproduce it only with my test and only on a freshly installed machine. > I would like to see the output with the LD_DEBUG=all environment variable > set. Attached are the debug files from sssd start. The PID of crashed process was 6546. Let me know if you need other traces. The crash looks like it happens when libsamba-debug-samba4.so is being relocated 6546: relocation processing: /usr/lib64/samba/libsamba-debug-samba4.so ... 6546: binding file /usr/lib64/samba/libsamba-debug-samba4.so [0] to /lib64/libc.so.6 [0]: normal symbol `gettimeofday' [GLIBC_2.3] This should continue on via the IFUNC resolver to lookup __kernel_gettimeofday and bind that to the kernel vDSO version of the function, but it doesn't seem like we get that far. In fact from the line nubmers we crash referencing GLRO(dl_sysinfo_map), which should be the link map given to us by the kernel. If I had to guess you're running an old kernel with broken vDSO support, but you did say "install fresh" so I assume this is a rhel-6.9 system running a rhel-6.9 kernel. We will need the machine setup again that reproduces the issue. (In reply to Patrik Kis from comment #12) > I might found another test where this issue appears. > > It's a test where slapd from openldap is executed with valgrind. I see > illegal opcode errors in libpthread. The reason why I think it may be > related is that it also happens only on ppc64 and the problem disappears > after about a hour. > > I hope it can help to find the root cause. Please let me know if you need > more info. > ==9135== valgrind: Unrecognised instruction at address 0x507a858. > ==9135== at 0x507A858: ??? (in /lib64/libpthread-2.12.so) > ==9135== by 0x507A86F: nptl_freeres (in /lib64/libpthread-2.12.so) > ==9135== by 0x5251633: freeres_libptread (in /lib64/libc-2.12.so) > ==9135== by 0x52512EF: __libc_freeres (in /lib64/libc-2.12.so) > ==9135== by 0x48509C3: _vgnU_freeres (vg_preloaded.c:62) > ==9135== by 0x5138F03: exit (exit.c:93) > ==9135== by 0x511C0EF: (below main) (libc-start.c:258) Everything in this sequence is normal except for the fact that nptl_freeres appears to jump into the middle of nowhere. This is exceptionally odd, even odder is that it goes away, indicative of a concurrency issue. I would initially suspect a faulty test case that ends up calling exit() concurrently from two threads, but that's just a guess. Can you get me access to the box and the exact commands that reproduce the issue and the sources for those commands? (In reply to Patrik Kis from comment #20) > I'm providing here the reproducer for booth cases I have found so far. All > cases are reproducible for 1-2 hours after the machine was provisioned. > Sorry I have not posted it here earlier. Patrik, could you check if disabling prelinking makes a difference to this issue? If it's somehow prelink-related, that would explain the timing aspect. (In reply to Florian Weimer from comment #23) > (In reply to Patrik Kis from comment #20) > > I'm providing here the reproducer for booth cases I have found so far. All > > cases are reproducible for 1-2 hours after the machine was provisioned. > > Sorry I have not posted it here earlier. > > Patrik, could you check if disabling prelinking makes a difference to this > issue? If it's somehow prelink-related, that would explain the timing > aspect. Can't see any difference, provided that I correctly disabled prelink, what I'm not sure. What I did is: # vim /etc/sysconfig/prelink # grep PRELINKING /etc/sysconfig/prelink PRELINKING=no # prelink -ua even rebooted the machine and my test is still failing the same way as before. There is a report on the freeipa-users list where this issue was seen while upgrading from RHEL-6.6 to RHEL-6.7, please see https://lists.fedoraproject.org/archives/list/freeipa-users@lists.fedorahosted.org/message/PJ2EDIPJ4EW2XNVRFORZBZPHVXINMYR4/ for details. *** Bug 1466897 has been marked as a duplicate of this bug. *** (In reply to Patrik Kis from comment #24) > Can't see any difference, provided that I correctly disabled prelink, what > I'm not sure. What I did is: > > # vim /etc/sysconfig/prelink > # grep PRELINKING /etc/sysconfig/prelink > PRELINKING=no > # prelink -ua > > even rebooted the machine and my test is still failing the same way as > before. I was wondering if disabling prelink makes the test fail consistently. You previously said that it miraculously started working after a while. Thanks. (In reply to Patrik Kis from comment #8) > Attached are the debug files from sssd start. The PID of crashed process was > 6546. Here are all the “relocation processing” lines from that file. 855: 6546: relocation processing: /lib64/libfreebl3.so (lazy) 995: 6546: relocation processing: /lib64/libz.so.1 (lazy) 1377: 6546: relocation processing: /usr/lib64/libsasl2.so.2 (lazy) 4158: 6546: relocation processing: /lib64/libresolv.so.2 (lazy) 4369: 6546: relocation processing: /usr/lib64/libpath_utils.so.1 (lazy) 4506: 6546: relocation processing: /lib64/libcrypt.so.1 (lazy) 4705: 6546: relocation processing: /usr/lib64/samba/libreplace-samba4.so 5058: 6546: relocation processing: /usr/lib64/samba/libsocket-blocking-samba4.so 5312: 6546: relocation processing: /usr/lib64/samba/libsamba-debug-samba4.so Curiously, it's not showing relocation processing for libc.so.6. Something could be wrong with the DSO dependency sorting (although libc.so.6 does come last on the search path). The lack of libc.so.6 relocation processing explains why GLRO(dl_sysinfo_map) symbol has not been relocated when the IFUNC resolver runs (see comment 9). The order of relocation processing may have been fixed upstream using this commit: commit 2bc174332ba6ddbd1b855dced33889bef56e8ba3 Author: Andreas Schwab <schwab> Date: Tue Aug 30 15:37:54 2011 +0200 Relocate objects in dependency order Prelink works around this issue, which explains the timing dependency. *** Bug 1425587 has been marked as a duplicate of this bug. *** It is likely that *all* binaries which are linked with -z now (BIND_NOW) and use gettimeofday trigger this issue. This means that using BIND_NOW on Red Hat Enterprise Linux 6 cannot be recommended until this bug is fixed. Do you recommend against BIND_NOW for all arches, or just ppc64? (I see that the commit in comment 32 is not arch-specific.) (In reply to Josh Stone from comment #41) > Do you recommend against BIND_NOW for all arches, or just ppc64? > (I see that the commit in comment 32 is not arch-specific.) gettimeofday is not an IFUNC on many architectures, so that part of comment 40 is architecture-specific. In addition, I couldn't find any IFUNC resolvers on x86-64 which require relocations, so using BIND_NOW on x86-64 should be fairly safe (by accident). However, the generic issue affects all architectures, and could also occur with other libraries which use IFUNC resolvers (not just glibc). Red Hat Enterprise Linux is currently in Production Phase 3 which means that only urgent priority bug fixes will be considered. We will not be enhancing the dynamic loader to support BIND_NOW with IFUNC relocations. (A key part of this work has not yet been implemented upstream yet.) Instead, we have suggested to the Samba developers to rebuild Samba without BIND_NOW. See bug 1492780. |