Hide Forgot
Description of problem: Could you please help investigate the following case? The issue is that sssd does not start when ipa-client-install is executed just after ppc64 machine install. The reason why it does not start is that sssd_pac crashes. Abrt caught the core dump. I also managed to start sssd with debug bug there is nothing for sssd_pac. The issue happen only on ppc64 machines, however I'm not entirely sure it's architecture specific, the core dump suggests it's again time sync related. It does not happen when the ipa-client-install is executed after some time (after an hour or two it is always passing), that also supports the theory about time diff. Version-Release number of selected component (if applicable): sssd-1.13.3-45.el6.ppc64 How reproducible: always Steps to Reproduce: 1. install a fresh ppc64 machine 2. run the test /CoreOS/authconfig/Regression/bz975203-Authconfig-does-not-restart-SSSD-or-Winbind-when Actual results: sssd_pac crash Expected results: sssd start Additional info: The backtrace of the core file: (gdb) bt #0 _dl_vdso_vsym (name=0x191e80 <Address 0x191e80 out of bounds>, vers=0xfffd1c6eb80) at ../sysdeps/unix/sysv/linux/dl-vdso.c:28 #1 0x00000fffa17237cc in gettimeofday_ifunc () at ../sysdeps/unix/sysv/linux/powerpc/gettimeofday.c:41 #2 0x00000fffa2404ef4 in resolve_ifunc (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at ../sysdeps/powerpc/powerpc64/dl-machine.h:551 #3 elf_machine_rela (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at ../sysdeps/powerpc/powerpc64/dl-machine.h:587 #4 elf_dynamic_do_rela (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at do-rel.h:120 #5 _dl_relocate_object (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at dl-reloc.c:268 #6 0x00000fffa23f8a34 in dl_main (phdr=<value optimized out>, phnum=<value optimized out>, user_entry=<value optimized out>, auxv=<value optimized out>) at rtld.c:2262 #7 0x00000fffa2411418 in _dl_sysdep_start (start_argptr=<value optimized out>, dl_main=@0xfffa2430148: 0xfffa23f76e0 <dl_main>) at ../elf/dl-sysdep.c:244 #8 0x00000fffa23f5e64 in _dl_start_final (arg=0xfffd1c6f630, info=<value optimized out>) at rtld.c:335 #9 0x00000fffa23f614c in _dl_start (arg=0xfffd1c6f630) at rtld.c:563 #10 0x00000fffa23f5510 in ._start () from /lib64/ld64.so.1 (gdb) bt f #0 _dl_vdso_vsym (name=0x191e80 <Address 0x191e80 out of bounds>, vers=0xfffd1c6eb80) at ../sysdeps/unix/sysv/linux/dl-vdso.c:28 map = <value optimized out> value = <value optimized out> #1 0x00000fffa17237cc in gettimeofday_ifunc () at ../sysdeps/unix/sysv/linux/powerpc/gettimeofday.c:41 linux2615 = {name = 0x191e70 <Address 0x191e70 out of bounds>, hash = 123718565, hidden = 1, filename = 0x0} vdso_gettimeofday = <value optimized out> #2 0x00000fffa2404ef4 in resolve_ifunc (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at ../sysdeps/powerpc/powerpc64/dl-machine.h:551 opd = {fd_func = 17590599694224, fd_toc = 17590600809344, fd_aux = 866336} #3 elf_machine_rela (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at ../sysdeps/powerpc/powerpc64/dl-machine.h:587 refsym = 0xfffa1010610 value = <value optimized out> reloc_addr = 0xfffa1030890 r_type = 21 sym_map = <value optimized out> #4 elf_dynamic_do_rela (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at do-rel.h:120 ndx = <value optimized out> version = 0xfffa1010e76 symtab = 0xfffa10102e0 relative = <value optimized out> r = 0xfffa1012130 end = 0xfffa1012298 l_addr = 17590592274432 #5 _dl_relocate_object (scope=0xfffa243fc98, reloc_mode=<value optimized out>, consider_profiling=<value optimized out>) at dl-reloc.c:268 ranges_index = <value optimized out> ranges = {{start = 17590592278536, size = 4752, lazy = 0}, {start = 17590591492304, size = 0, lazy = 1}} edr_lazy = <value optimized out> textrels = 0x0 errstring = 0x0 lazy = <value optimized out> #6 0x00000fffa23f8a34 in dl_main (phdr=<value optimized out>, phnum=<value optimized out>, user_entry=<value optimized out>, auxv=<value optimized out>) at rtld.c:2262 lnp = <value optimized out> consider_profiling = 0 start = 21545157465003374 stop = <value optimized out> l = 0xfffa243f928 ph = <value optimized out> mode = normal main_map = 0xfffa2432a70 file_size = 0 file = <value optimized out> has_interp = <value optimized out> i = <value optimized out> prelinked = <value optimized out> rtld_is_main = <value optimized out> start = <value optimized out> stop = <value optimized out> diff = <value optimized out> tcbp = 0xfffa10081b0 first_preload = 0xfffa2433018 r = 0x4c8b34e154d56e rtld_ehdr = <value optimized out> rtld_phdr = <value optimized out> cnt = <value optimized out> count_modids = 0 preloads = <value optimized out> npreloads = <value optimized out> preload_file = "/etc/ld.so.preload" rtld_multiple_ref = true was_tls_init_tp_called = false #7 0x00000fffa2411418 in _dl_sysdep_start (start_argptr=<value optimized out>, dl_main=@0xfffa2430148: 0xfffa23f76e0 <dl_main>) at ../elf/dl-sysdep.c:244 phdr = <value optimized out> phnum = <value optimized out> user_entry = 268573936 av = <value optimized out> #8 0x00000fffa23f5e64 in _dl_start_final (arg=0xfffd1c6f630, info=<value optimized out>) at rtld.c:335 ---Type <return> to continue, or q <return> to quit--- start_addr = <value optimized out> rtld_total_time = <value optimized out> #9 0x00000fffa23f614c in _dl_start (arg=0xfffd1c6f630) at rtld.c:563 info = {l = {l_addr = 17590613114880, l_name = 0x0, l_ld = 0xfffa242fe60, l_next = 0x0, l_prev = 0x0, l_real = 0x0, l_ns = 0, l_libname = 0x0, l_info = {0x0, 0x0, 0xfffa242fee0, 0xfffa242fed0, 0xfffa242fe70, 0xfffa242fe90, 0xfffa242fea0, 0xfffa242ff20, 0xfffa242ff30, 0xfffa242ff40, 0xfffa242feb0, 0xfffa242fec0, 0x0, 0x0, 0xfffa242fe60, 0x0, 0x0, 0x0, 0x0, 0x0, 0xfffa242fef0, 0x0, 0x0, 0xfffa242ff00, 0xfffa242ff70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xfffa242ff10, 0x0, 0x0, 0x0, 0x0, 0xfffa242ff60, 0xfffa242ff50, 0xfffa242ff80, 0x0, 0xfffa242ffa0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xfffa242ff90, 0x0 <repeats 25 times>, 0xfffa242fe80}, l_phdr = 0x0, l_entry = 0, l_phnum = 0, l_ldnum = 0, l_searchlist = {r_list = 0x0, r_nlist = 0}, l_symbolic_searchlist = {r_list = 0x0, r_nlist = 0}, l_loader = 0x0, l_versions = 0x0, l_nversions = 0, l_nbuckets = 0, l_gnu_bitmask_idxbits = 0, l_gnu_shift = 0, l_gnu_bitmask = 0x0, {l_gnu_buckets = 0x0, l_chain = 0x0}, {l_gnu_chain_zero = 0x0, l_buckets = 0x0}, l_direct_opencount = 0, l_type = lt_executable, l_relocated = 1, l_init_called = 0, l_global = 0, l_reserved = 0, l_phdr_allocated = 0, l_soname_added = 0, l_faked = 0, l_need_tls_init = 0, l_auditing = 0, l_audit_any_plt = 0, l_removed = 0, l_contiguous = 0, l_symbolic_in_local_scope = 0, l_free_initfini = 0, l_rpath_dirs = {dirs = 0x0, malloced = 0}, l_reloc_result = 0x0, l_versyms = 0x0, l_origin = 0x0, l_map_start = 0, l_map_end = 0, l_text_end = 0, l_scope_mem = {0x0, 0x0, 0x0, 0x0}, l_scope_max = 0, l_scope = 0x0, l_local_scope = {0x0, 0x0}, l_dev = 0, l_ino = 0, l_runpath_dirs = {dirs = 0x0, malloced = 0}, l_initfini = 0x0, l_reldeps = 0x0, l_reldepsmax = 0, l_used = 0, l_feature_1 = 0, l_flags_1 = 0, l_flags = 0, l_idx = 0, l_mach = {<No data fields>}, l_lookup_cache = {sym = 0x0, type_class = 0, value = 0x0, ret = 0x0}, l_tls_initimage = 0x0, l_tls_initimage_size = 0, l_tls_blocksize = 0, l_tls_align = 0, l_tls_firstbyte_offset = 0, l_tls_offset = -1, l_tls_modid = 0, l_relro_addr = 0, l_relro_size = 0, l_serial = 0, l_audit = 0xfffd1c6f0f0}, start_time = 21545157452809918} #10 0x00000fffa23f5510 in ._start () from /lib64/ld64.so.1 No symbol table info available. (gdb)
If I read the backtrace correctly the crash happened while ld64.so is preparing the runtime environment. Reassigning to glibc.
Is there a simple command I can run to reproduce this issue? I would like to see the output with the LD_DEBUG=all environment variable set.
Created attachment 1225227 [details] ld debug of sssd start (In reply to Florian Weimer from comment #6) > Is there a simple command I can run to reproduce this issue? > I can reproduce it only with my test and only on a freshly installed machine. > I would like to see the output with the LD_DEBUG=all environment variable > set. Attached are the debug files from sssd start. The PID of crashed process was 6546. Let me know if you need other traces.
The crash looks like it happens when libsamba-debug-samba4.so is being relocated 6546: relocation processing: /usr/lib64/samba/libsamba-debug-samba4.so ... 6546: binding file /usr/lib64/samba/libsamba-debug-samba4.so [0] to /lib64/libc.so.6 [0]: normal symbol `gettimeofday' [GLIBC_2.3] This should continue on via the IFUNC resolver to lookup __kernel_gettimeofday and bind that to the kernel vDSO version of the function, but it doesn't seem like we get that far. In fact from the line nubmers we crash referencing GLRO(dl_sysinfo_map), which should be the link map given to us by the kernel. If I had to guess you're running an old kernel with broken vDSO support, but you did say "install fresh" so I assume this is a rhel-6.9 system running a rhel-6.9 kernel. We will need the machine setup again that reproduces the issue.
(In reply to Patrik Kis from comment #12) > I might found another test where this issue appears. > > It's a test where slapd from openldap is executed with valgrind. I see > illegal opcode errors in libpthread. The reason why I think it may be > related is that it also happens only on ppc64 and the problem disappears > after about a hour. > > I hope it can help to find the root cause. Please let me know if you need > more info. > ==9135== valgrind: Unrecognised instruction at address 0x507a858. > ==9135== at 0x507A858: ??? (in /lib64/libpthread-2.12.so) > ==9135== by 0x507A86F: nptl_freeres (in /lib64/libpthread-2.12.so) > ==9135== by 0x5251633: freeres_libptread (in /lib64/libc-2.12.so) > ==9135== by 0x52512EF: __libc_freeres (in /lib64/libc-2.12.so) > ==9135== by 0x48509C3: _vgnU_freeres (vg_preloaded.c:62) > ==9135== by 0x5138F03: exit (exit.c:93) > ==9135== by 0x511C0EF: (below main) (libc-start.c:258) Everything in this sequence is normal except for the fact that nptl_freeres appears to jump into the middle of nowhere. This is exceptionally odd, even odder is that it goes away, indicative of a concurrency issue. I would initially suspect a faulty test case that ends up calling exit() concurrently from two threads, but that's just a guess. Can you get me access to the box and the exact commands that reproduce the issue and the sources for those commands?
(In reply to Patrik Kis from comment #20) > I'm providing here the reproducer for booth cases I have found so far. All > cases are reproducible for 1-2 hours after the machine was provisioned. > Sorry I have not posted it here earlier. Patrik, could you check if disabling prelinking makes a difference to this issue? If it's somehow prelink-related, that would explain the timing aspect.
(In reply to Florian Weimer from comment #23) > (In reply to Patrik Kis from comment #20) > > I'm providing here the reproducer for booth cases I have found so far. All > > cases are reproducible for 1-2 hours after the machine was provisioned. > > Sorry I have not posted it here earlier. > > Patrik, could you check if disabling prelinking makes a difference to this > issue? If it's somehow prelink-related, that would explain the timing > aspect. Can't see any difference, provided that I correctly disabled prelink, what I'm not sure. What I did is: # vim /etc/sysconfig/prelink # grep PRELINKING /etc/sysconfig/prelink PRELINKING=no # prelink -ua even rebooted the machine and my test is still failing the same way as before.
There is a report on the freeipa-users list where this issue was seen while upgrading from RHEL-6.6 to RHEL-6.7, please see https://lists.fedoraproject.org/archives/list/freeipa-users@lists.fedorahosted.org/message/PJ2EDIPJ4EW2XNVRFORZBZPHVXINMYR4/ for details.
*** Bug 1466897 has been marked as a duplicate of this bug. ***
(In reply to Patrik Kis from comment #24) > Can't see any difference, provided that I correctly disabled prelink, what > I'm not sure. What I did is: > > # vim /etc/sysconfig/prelink > # grep PRELINKING /etc/sysconfig/prelink > PRELINKING=no > # prelink -ua > > even rebooted the machine and my test is still failing the same way as > before. I was wondering if disabling prelink makes the test fail consistently. You previously said that it miraculously started working after a while. Thanks.
(In reply to Patrik Kis from comment #8) > Attached are the debug files from sssd start. The PID of crashed process was > 6546. Here are all the “relocation processing” lines from that file. 855: 6546: relocation processing: /lib64/libfreebl3.so (lazy) 995: 6546: relocation processing: /lib64/libz.so.1 (lazy) 1377: 6546: relocation processing: /usr/lib64/libsasl2.so.2 (lazy) 4158: 6546: relocation processing: /lib64/libresolv.so.2 (lazy) 4369: 6546: relocation processing: /usr/lib64/libpath_utils.so.1 (lazy) 4506: 6546: relocation processing: /lib64/libcrypt.so.1 (lazy) 4705: 6546: relocation processing: /usr/lib64/samba/libreplace-samba4.so 5058: 6546: relocation processing: /usr/lib64/samba/libsocket-blocking-samba4.so 5312: 6546: relocation processing: /usr/lib64/samba/libsamba-debug-samba4.so Curiously, it's not showing relocation processing for libc.so.6. Something could be wrong with the DSO dependency sorting (although libc.so.6 does come last on the search path). The lack of libc.so.6 relocation processing explains why GLRO(dl_sysinfo_map) symbol has not been relocated when the IFUNC resolver runs (see comment 9).
The order of relocation processing may have been fixed upstream using this commit: commit 2bc174332ba6ddbd1b855dced33889bef56e8ba3 Author: Andreas Schwab <schwab> Date: Tue Aug 30 15:37:54 2011 +0200 Relocate objects in dependency order Prelink works around this issue, which explains the timing dependency.
*** Bug 1425587 has been marked as a duplicate of this bug. ***
It is likely that *all* binaries which are linked with -z now (BIND_NOW) and use gettimeofday trigger this issue. This means that using BIND_NOW on Red Hat Enterprise Linux 6 cannot be recommended until this bug is fixed.
Do you recommend against BIND_NOW for all arches, or just ppc64? (I see that the commit in comment 32 is not arch-specific.)
(In reply to Josh Stone from comment #41) > Do you recommend against BIND_NOW for all arches, or just ppc64? > (I see that the commit in comment 32 is not arch-specific.) gettimeofday is not an IFUNC on many architectures, so that part of comment 40 is architecture-specific. In addition, I couldn't find any IFUNC resolvers on x86-64 which require relocations, so using BIND_NOW on x86-64 should be fairly safe (by accident). However, the generic issue affects all architectures, and could also occur with other libraries which use IFUNC resolvers (not just glibc).
Red Hat Enterprise Linux is currently in Production Phase 3 which means that only urgent priority bug fixes will be considered. We will not be enhancing the dynamic loader to support BIND_NOW with IFUNC relocations. (A key part of this work has not yet been implemented upstream yet.) Instead, we have suggested to the Samba developers to rebuild Samba without BIND_NOW. See bug 1492780.