Bug 1377895
Summary: | glibc: [LLNL 7.4 Bug] Relocation dependency on symbol, but no DT_NEEDED, causes incorrect startup sequence. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Ben Woodard <woodard> | ||||||
Component: | glibc | Assignee: | glibc team <glibc-bugzilla> | ||||||
Status: | CLOSED MIGRATED | QA Contact: | qe-baseos-tools-bugs | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 8.2 | CC: | ashankar, balay, bloch, codonell, dj, foraker1, fweimer, kenneth.hoste, mnewsome, pasteur, pfrankli, tdhooge, tgummels, woodard | ||||||
Target Milestone: | rc | Keywords: | MigratedToJIRA, Reopened, Triaged | ||||||
Target Release: | 8.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2023-11-28 15:49:44 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1599298 | ||||||||
Attachments: |
|
Description
Ben Woodard
2016-09-20 23:12:31 UTC
16:14 foraker: processor : 0 16:14 foraker: vendor_id : GenuineIntel 16:14 foraker: cpu family : 6 16:14 foraker: model : 79 16:14 foraker: model name : Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz 16:14 foraker: stepping : 1 16:14 foraker: microcode : 0xb00001d 16:14 foraker: cpu MHz : 2101.000 16:14 foraker: cache size : 46080 KB 16:14 foraker: physical id : 0 16:14 foraker: siblings : 36 16:14 foraker: core id : 0 16:14 foraker: cpu cores : 18 16:14 foraker: apicid : 0 16:14 foraker: initial apicid : 0 16:14 foraker: fpu : yes 16:14 foraker: fpu_exception : yes 16:14 foraker: cpuid level : 20 16:14 foraker: wp : yes 16:14 foraker: flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsav 16:15 foraker: e avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local 16:15 foraker: bogomips : 4190.23 16:15 foraker: clflush size : 64 16:15 foraker: cache_alignment : 64 16:15 foraker: address sizes : 46 bits physical, 48 bits virtual 16:15 foraker: power management: Some missing context from our conversation is here: 15:22 foraker: #0 0x00002aaaabfb065e in __libc_memmove () at ../sysdeps/x86_64/multiarch/memmove.c:52 15:22 foraker: #1 0x00002aaaaaab7675 in elf_machine_rela (reloc=0x2aaaada69e38, reloc=0x2aaaada69e38, skip_ifunc=<optimized out>, reloc_addr_arg=0x2aaaadccd8f0, version=0x0, sym=0x2aaaada64f58, map=0x2aaaaab0a548) at ../sysdeps/x86_64/dl-machine.h:288 15:22 foraker: #2 elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>, map=0x2aaaaab0a548) at do-rel.h:170 ... Which shows the sigsegv is while running the IFUNC resolver for memmove, which is odd. On x86_64 the IFUNC resolver should simply be looking up entries in the runtime list of cpu-specific features (`GLRO(dl_x86_cpu_features)`) to determine which routine to pick based on the hardware. We will need to reproduce this locally if we are going to make any informed comment about the icc-compiled binary. Are you able to get us a copy of the binary and associated libraries so we can run this in rhel-7 and debug the failure? The only possible theory I have is that your libc.so.6 and ld.so are out of sync with eachother, since their definition of the `GLRO(dl_x86_cpu_features)` structure is shared via a private interface i.e. __get_cpu_features@@GLIBC_PRIVATE. This might result in libc.so.6 indexing outside of the size of the structure returned by ld.so. So if I had to look for something right now, it would be to verify that the target environment is correctly configured i.e. that /lib64/ld* and /lib64/libc.so.6 match. Created attachment 1203100 [details]
mpicc and associated openmpi/icc libraries that exhibit the issue
libc.so.6 and ld.so appear to match: quartz2{foraker1}27: rpm -qf /lib64/libc.so.6 /lib64/ld-2.17.so glibc-2.17-157.el7.x86_64 glibc-2.17-157.el7.x86_64 quartz2{foraker1}28: ls -l /lib64/ld-linux-x86-64.so.2 lrwxrwxrwx. 1 root root 10 Sep 1 17:49 /lib64/ld-linux-x86-64.so.2 -> ld-2.17.so (In reply to Jim Foraker from comment #4) > Created attachment 1203100 [details] > mpicc and associated openmpi/icc libraries that exhibit the issue The mpicc binary does not appear to be included in the tarball. mpicc is in practice a symlink to opal_wrapper: $ tar ztvf broken-mpicc.tar.gz | head -2 lrwxrwxrwx root/root 0 2016-09-01 18:04 opt/openmpi/1.10/intel/bin/mpicc -> opal_wrapper -rwxr-xr-x root/root 150599 2016-06-28 14:13 opt/openmpi/1.10/intel/bin/opal_wrapper This is standard behavior; many of OpenMPI's commands are symlinks to a small handful of binaries. (In reply to Jim Foraker from comment #7) > mpicc is in practice a symlink to opal_wrapper: > > $ tar ztvf broken-mpicc.tar.gz | head -2 > lrwxrwxrwx root/root 0 2016-09-01 18:04 > opt/openmpi/1.10/intel/bin/mpicc -> opal_wrapper > -rwxr-xr-x root/root 150599 2016-06-28 14:13 > opt/openmpi/1.10/intel/bin/opal_wrapper > > This is standard behavior; many of OpenMPI's commands are symlinks to a > small handful of binaries. Ah, I had missed that. I have installed these packages: openmpi-1.10.3-3.el7.x86_64 openmpi-devel-1.10.3-3.el7.x86_64 glibc-2.17-157.el7.x86_64 And still cannot reproduce the issue. All I get is this: $ LD_LIBRARY_PATH=/usr/lib64/openmpi/lib /opt/openmpi/1.10/intel/bin/mpicc gcc: fatal error: no input files compilation terminated. I don't know anything about mpicc or openpmi, so I'd appreciate precise reproduction instructions. We are not using the RHEL-supplied OpenMPI RPMs. We compile our own MPIs. mpicc is a compiler wrapper that is required to compile MPI-enabled code. It calls the compiler it was compiled against in turn to generate the actual object code. Since your mpicc called gcc, I don't believe your environment is set up correctly; either your mpicc symlink is pointing to the wrong place, or your opal_wrapper binary is not the one provided; if operating correctly, it would attempt to run the Intel not GNU compiler (icc not gcc). Since presumably you don't have icc installed, you should instead see a message like this (from a 7.2-based machine): quartz187{foraker1}37: mpicc -------------------------------------------------------------------------- The Open MPI wrapper compiler was unable to find the specified compiler icc in your PATH. Note that this compiler was either specified at configure time or in one of several possible environment variables. -------------------------------------------------------------------------- To run the provided binary, you will need to set your LD_LIBRARY_PATH to point at the libraries provided, NOT the ones out of the RHEL OpenMPI RPMs: sh-4.2$ LD_LIBRARY_PATH=/opt/openmpi/1.10/intel/lib:/opt/intel/16.0/compiler/lib/intel64 ldd /opt/openmpi/1.10/intel/bin/mpicc linux-vdso.so.1 => (0x00002aaaaaaab000) libopen-pal.so.13 => /opt/openmpi/1.10/intel/lib/libopen-pal.so.13 (0x00002aaaaaaae000) libm.so.6 => /lib64/libm.so.6 (0x00002aaaaadf3000) libnuma.so.1 => /lib64/libnuma.so.1 (0x00002aaaab0f5000) libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab301000) librt.so.1 => /lib64/librt.so.1 (0x00002aaaab506000) libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaab70e000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002aaaab911000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaabb28000) libc.so.6 => /lib64/libc.so.6 (0x00002aaaabd44000) libimf.so => /opt/intel/16.0/compiler/lib/intel64/libimf.so (0x00002aaaac105000) libsvml.so => /opt/intel/16.0/compiler/lib/intel64/libsvml.so (0x00002aaaac604000) libirng.so => /opt/intel/16.0/compiler/lib/intel64/libirng.so (0x00002aaaad510000) libintlc.so.5 => /opt/intel/16.0/compiler/lib/intel64/libintlc.so.5 (0x00002aaaad882000) /lib64/ld-linux-x86-64.so.2 (0x0000555555554000) That should produce the segfault. Created attachment 1203471 [details]
Missing file
This was the missing file from the already attached reproducer tar file.
The steps to reproduce are:
root@intel-wildcatpass-07 ~]# tar xvf broken-mpicc.tar
root@intel-wildcatpass-07 ~]# mv libopen-pal.so.13.0.3 opt/openmpi/1.10/intel/lib/
root@intel-wildcatpass-07 ~]# LD_LIBRARY_PATH=opt/openmpi/1.10/intel/lib:opt/intel/16.0/compiler/lib/intel64 opt/openmpi/1.10/intel/bin/mpicc
Segmentation fault (core dumped)
To verify that it was the same problem I compared it to the previous backtrace:
root@intel-wildcatpass-07 ~]# gdb opt/openmpi/1.10/intel/bin/mpicc
<snip>
(gdb) set env LD_LIBRARY_PATH opt/openmpi/1.10/intel/lib:opt/intel/16.0/compiler/lib/intel64
(gdb) r
Starting program: /root/opt/openmpi/1.10/intel/bin/mpicc
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff684b65e in ?? ()
(gdb) bt
#0 0x00007ffff684b65e in ?? ()
#1 0x00007ffff7de9675 in elf_machine_rela (reloc=0x7ffff4ddfe38, reloc=0x7ffff4ddfe38, skip_ifunc=<optimized out>,
reloc_addr_arg=0x7ffff50438f0, version=0x0, sym=0x7ffff4ddaf58, map=0x7ffff7fe2af8)
at ../sysdeps/x86_64/dl-machine.h:288
#2 elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>, nrelative=<optimized out>,
relsize=<optimized out>, reladdr=<optimized out>, map=0x7ffff7fe2af8) at do-rel.h:170
#3 _dl_relocate_object (scope=<optimized out>, reloc_mode=<optimized out>, consider_profiling=<optimized out>,
consider_profiling@entry=0) at dl-reloc.c:259
#4 0x00007ffff7de0792 in dl_main (phdr=<optimized out>, phdr@entry=0x400040, phnum=<optimized out>, phnum@entry=9,
user_entry=user_entry@entry=0x7fffffffe0b8, auxv=<optimized out>) at rtld.c:2192
#5 0x00007ffff7df3e36 in _dl_sysdep_start (start_argptr=start_argptr@entry=0x7fffffffe170,
dl_main=dl_main@entry=0x7ffff7dde820 <dl_main>) at ../elf/dl-sysdep.c:244
#6 0x00007ffff7de1a31 in _dl_start_final (arg=0x7fffffffe170) at rtld.c:318
#7 _dl_start (arg=0x7fffffffe170) at rtld.c:544
#8 0x00007ffff7dde1e8 in _start () from /lib64/ld-linux-x86-64.so.2
#9 0x0000000000000001 in ?? ()
#10 0x00007fffffffe40e in ?? ()
#11 0x0000000000000000 in ?? ()
opt/intel/16.0/compiler/lib/intel64/libintlc.so.5 is not linked against libc.so.6, but uses symbols from libc.so.6. This looks like swbz#20019. We could perhaps avoid the crash, but the libintlc.so.5 object file is simply invalid. The static linker did not process the undefined libc.so.6 symbol references, and as a result, all these symbol references are unversioned, and will not bind against the correct libc.so.6 symbols. Potential workaround: Preload libc.so.6 and the offending library. This alters the resolution order and may get things to work: LD_PRELOAD=/lib64/libc.so.6:/opt/intel/16.0/compiler/lib/intel64/libintlc.so.5 mpicc However, this does not address the missing symbol versions. I don't think that the LD_PRELOAD is the direction that LLNL wants to take but it may be an acceptable workaround for the time being. I recommended that they bounce this their intel compiler support person and file a bug against the intel compiler. I think that to resolve this issue Intel will need to change their build scripts so that when they build libintl.so they make sure that they link against libc This should do two things: 1) it will make DT_NEEDED in the ELF file so that ld.so will pull it in first 2) it will properly version the builtin memmove Also I just heard that the LD_PRELOAD trick does not work for LLNL. scratch that last comment the LD_PRELOAD trick does work after all. (In reply to Florian Weimer from comment #11) > opt/intel/16.0/compiler/lib/intel64/libintlc.so.5 is not linked against > libc.so.6, but uses symbols from libc.so.6. > > This looks like swbz#20019. > > We could perhaps avoid the crash, but the libintlc.so.5 object file is > simply invalid. The static linker did not process the undefined libc.so.6 > symbol references, and as a result, all these symbol references are > unversioned, and will not bind against the correct libc.so.6 symbols. It's not clear to me that libintlc.so.5 is invalid. If the shared object is not built against libc.so.6 then the author of the shared object accepts that all unversioned references will bind to the lastest versioned references found first during resolution, simliar to dlopen/dlsym. I updated swbz#20019 to mention that perhaps glibc's dynamic loader should have used the relocation dependencies to correctly sort libc.so.6's initialization first. I'm playing devil's advocate here, since libintlc.so.5 is a really weird object, and there isn't any serious justification for building it like it is, but at the same time glibc's dynamic loader has enough information to do better. This issue appears to only impact ICC 16 which is missing the DT_NEEDED on libc.so.6. The ICC 17 libintlc.so.5 has a DT_NEEDED in libc.so.6 which ensures the right ordering with respect to STT_GNU_IFUNC initialization. I'm lowering the priority of this issue to medium since the workaround is to upgrade to ICC 17 or preload libc.so.6 (see comment #12). The correct solution for this issue is as noted in the upstream bug, which is to make symbol dependency sorting a first class solution and thus have the topological sort of library initialization setup libc.so.6 first. This is not without some perils and needs quite a bit of upstream work first. After we (HPC-UGent) contacted Intel about this and put some pressure on it, they have issued a support article acknowledging the problem that includes an alternate (more feasible imho) workaround, which is to overwrite the libintlc.so.5 in the Intel v16 installation with a copy from the Intel v17 compilers, see https://software.intel.com/en-us/articles/intel-compiler-version-16-not-compatible-with-recent-libcso6 . (In reply to Kenneth Hoste from comment #26) > After we (HPC-UGent) contacted Intel about this and put some pressure on it, > they have issued a support article acknowledging the problem that includes > an alternate (more feasible imho) workaround, which is to overwrite the > libintlc.so.5 in the Intel v16 installation with a copy from the Intel v17 > compilers, see > https://software.intel.com/en-us/articles/intel-compiler-version-16-not- > compatible-with-recent-libcso6 . Thanks for passing this along. It turns out that issue can also appear with completely valid binaries. I have attempted to fix this upstream, but there does not appear to be a convincing general fix, as I explained in this message: <https://sourceware.org/ml/libc-alpha/2017-01/msg00468.html> We'll see if anyone else comes up with a better solution with less performance impact. *** Bug 1410576 has been marked as a duplicate of this bug. *** With no direct progress on this issue in upstream, and with RHEL 7 entering Maintenance Phase 1 at the end of 2019, I'm moving this issue to RHEL 8 for further consideration. It is an interesting issue that should be fixed, but requires some serious upstream work. Carlos and I have investigated the current state here. (a) If a shared object uses glibc string functions, it must have a DT_NEEDED reference on libc.so.6. (b) With the DT_NEEDED reference, libc.so.6 is always relocated first, before that object, so the relocation dependency on an IFUNC resolver is not a problem. (c) There are other IFUNC resolvers in glibc (outside libc.so.6) which do not necessarily have this property because they interpose symbols in libc.so.6 (e.g., vfork in libpthread). Their removal is tracked in bug 1748197. (This aspect covers the discussion referenced in comment 27; upstream chose a different resolution, not delayed IFUNC processing.) We cannot fix binaries which violate (a), so there is nothing to do for this bug here. We now have the machinery to fix this. Reopening. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. |