Bug 1377895

Summary: glibc: [LLNL 7.4 Bug] Relocation dependency on symbol, but no DT_NEEDED, causes incorrect startup sequence.
Product: Red Hat Enterprise Linux 8 Reporter: Ben Woodard <woodard>
Component: glibcAssignee: glibc team <glibc-bugzilla>
Status: CLOSED MIGRATED QA Contact: qe-baseos-tools-bugs
Severity: high Docs Contact:
Priority: medium    
Version: 8.2CC: ashankar, balay, bloch, codonell, dj, foraker1, fweimer, kenneth.hoste, mnewsome, pasteur, pfrankli, tdhooge, tgummels, woodard
Target Milestone: rcKeywords: MigratedToJIRA, Reopened, Triaged
Target Release: 8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-28 15:49:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1599298    
Attachments:
Description Flags
mpicc and associated openmpi/icc libraries that exhibit the issue
none
Missing file none

Description Ben Woodard 2016-09-20 23:12:31 UTC
Description of problem:
mpicc compiled as part of our local OpenMPI installation when compiled with ICC 16.0 (but not ICC 15.0) triggers a segv when run. This appears to be an ABI break between glibc 7.2 and 7.3. When we revert the glibc to the one from 7.2 the problem goes away.

15:22 foraker: #2  elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>,  nrelative=<optimized out>, relsize=<optimized out>,  reladdr=<optimized out>, map=0x2aaaaab0a548) at do-rel.h:170
15:22 foraker: #3  _dl_relocate_object (scope=<optimized out>, reloc_mode=<optimized out>,  consider_profiling=<optimized out>, consider_profiling@entry=0) at dl-reloc.c:259
15:22 foraker: #4  0x00002aaaaaaae792 in dl_main (phdr=<optimized out>, phdr@entry=0x400040,  phnum=<optimized out>, phnum@entry=9,  user_entry=user_entry@entry=0x7fffffffd428, auxv=<optimized out>) at rtld.c:2192
15:22 foraker: #5  0x00002aaaaaac1e36 in _dl_sysdep_start ( start_argptr=start_argptr@entry=0x7fffffffd4e0,  dl_main=dl_main@entry=0x2aaaaaaac820 <dl_main>) at ../elf/dl-sysdep.c:244
15:22 foraker: #6  0x00002aaaaaaafa31 in _dl_start_final (arg=0x7fffffffd4e0) at rtld.c:318
15:22 foraker: #7  _dl_start (arg=0x7fffffffd4e0) at rtld.c:544
15:22 foraker: #8  0x00002aaaaaaac1e8 in _start () from /lib64/ld-linux-x86-64.so.2
15:22 foraker: #9  0x0000000000000001 in ?? ()
15:22 foraker: #10 0x00007fffffffd8e7 in ?? ()
15:22 foraker: #11 0x0000000000000000 in ?? ()

Version-Release number of selected component (if applicable):
glibc-2.17-157.el7.x86_64

How reproducible:

Steps to Reproduce:
15:02 foraker: quartz2{foraker1}:module load intel openmpi-intel/1.10
15:02 foraker: quartz2{foraker1}:mpicc
15:02 foraker: Segmentation fault

Additional info:
Carlos asked me to file a bug:
15:50 codonell: neb, File a bug please. Include /proc/cpuinfo please.
15:51 codonell: neb, And include exactly what the crash looks like.
15:51 codonell: neb, e.g. SIGILL, SIGSEGV...
15:51 codonell: neb, etc. etc.
15:51 neb: file a bug?
15:51 codonell: neb, We've made some changes in rhel-7.3 for Intel Purley hardware so this area has new code.
15:51 codonell: neb, Yes please.
15:52 neb: Like what is the proximate cause, I'm still trying to make heads or tails of it.
15:52 neb: can you give me a hand-wavy explanation of what may be going on?
15:53 neb: could it be fixed with a recompile?
15:53 codonell: neb, No ABI should be broken.
15:53 codonell: neb, So it's not about recompiling.
15:54 codonell: neb, It's about the hardware you're running on.
15:54 codonell: neb, The particular line you quote is checking to see if AVX512F is usable, but that should be a quick look into a feature table.
15:54 codonell: neb, Nothing should ever crash there.

Comment 1 Ben Woodard 2016-09-20 23:16:27 UTC
16:14 foraker: processor       : 0
16:14 foraker: vendor_id       : GenuineIntel
16:14 foraker: cpu family      : 6
16:14 foraker: model           : 79
16:14 foraker: model name      : Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
16:14 foraker: stepping        : 1
16:14 foraker: microcode       : 0xb00001d
16:14 foraker: cpu MHz         : 2101.000
16:14 foraker: cache size      : 46080 KB
16:14 foraker: physical id     : 0
16:14 foraker: siblings        : 36
16:14 foraker: core id         : 0
16:14 foraker: cpu cores       : 18
16:14 foraker: apicid          : 0
16:14 foraker: initial apicid  : 0
16:14 foraker: fpu             : yes
16:14 foraker: fpu_exception   : yes
16:14 foraker: cpuid level     : 20
16:14 foraker: wp              : yes
16:14 foraker: flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsav
16:15 foraker: e avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
16:15 foraker: bogomips        : 4190.23
16:15 foraker: clflush size    : 64
16:15 foraker: cache_alignment : 64
16:15 foraker: address sizes   : 46 bits physical, 48 bits virtual
16:15 foraker: power management:

Comment 3 Carlos O'Donell 2016-09-21 02:09:05 UTC
Some missing context from our conversation is here:

15:22 foraker: #0  0x00002aaaabfb065e in __libc_memmove () at ../sysdeps/x86_64/multiarch/memmove.c:52
15:22 foraker: #1  0x00002aaaaaab7675 in elf_machine_rela (reloc=0x2aaaada69e38,  reloc=0x2aaaada69e38, skip_ifunc=<optimized out>,  reloc_addr_arg=0x2aaaadccd8f0, version=0x0, sym=0x2aaaada64f58,  map=0x2aaaaab0a548) at ../sysdeps/x86_64/dl-machine.h:288
15:22 foraker: #2  elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>,  nrelative=<optimized out>, relsize=<optimized out>,  reladdr=<optimized out>, map=0x2aaaaab0a548) at do-rel.h:170
...

Which shows the sigsegv is while running the IFUNC resolver for memmove, which is odd. On x86_64 the IFUNC resolver should simply be looking up entries in the runtime list of cpu-specific features (`GLRO(dl_x86_cpu_features)`) to determine which routine to pick based on the hardware.

We will need to reproduce this locally if we are going to make any informed comment about the icc-compiled binary.

Are you able to get us a copy of the binary and associated libraries so we can run this in rhel-7 and debug the failure?

The only possible theory I have is that your libc.so.6 and ld.so are out of sync with eachother, since their definition of the `GLRO(dl_x86_cpu_features)` structure is shared via a private interface i.e. __get_cpu_features@@GLIBC_PRIVATE. This might result in libc.so.6 indexing outside of the size of the structure returned by ld.so. So if I had to look for something right now, it would be to verify that the target environment is correctly configured i.e. that /lib64/ld* and /lib64/libc.so.6 match.

Comment 4 Jim Foraker 2016-09-21 02:38:11 UTC
Created attachment 1203100 [details]
mpicc and associated openmpi/icc libraries that exhibit the issue

Comment 5 Jim Foraker 2016-09-21 02:39:43 UTC
libc.so.6 and ld.so appear to match:

quartz2{foraker1}27: rpm -qf /lib64/libc.so.6 /lib64/ld-2.17.so
glibc-2.17-157.el7.x86_64
glibc-2.17-157.el7.x86_64
quartz2{foraker1}28: ls -l /lib64/ld-linux-x86-64.so.2 
lrwxrwxrwx. 1 root root 10 Sep  1 17:49 /lib64/ld-linux-x86-64.so.2 -> ld-2.17.so

Comment 6 Florian Weimer 2016-09-21 12:59:35 UTC
(In reply to Jim Foraker from comment #4)
> Created attachment 1203100 [details]
> mpicc and associated openmpi/icc libraries that exhibit the issue

The mpicc binary does not appear to be included in the tarball.

Comment 7 Jim Foraker 2016-09-21 15:25:35 UTC
mpicc is in practice a symlink to opal_wrapper:

$ tar ztvf broken-mpicc.tar.gz | head -2
lrwxrwxrwx root/root         0 2016-09-01 18:04 opt/openmpi/1.10/intel/bin/mpicc -> opal_wrapper
-rwxr-xr-x root/root    150599 2016-06-28 14:13 opt/openmpi/1.10/intel/bin/opal_wrapper

This is standard behavior; many of OpenMPI's commands are symlinks to a small handful of binaries.

Comment 8 Florian Weimer 2016-09-21 16:06:26 UTC
(In reply to Jim Foraker from comment #7)
> mpicc is in practice a symlink to opal_wrapper:
> 
> $ tar ztvf broken-mpicc.tar.gz | head -2
> lrwxrwxrwx root/root         0 2016-09-01 18:04
> opt/openmpi/1.10/intel/bin/mpicc -> opal_wrapper
> -rwxr-xr-x root/root    150599 2016-06-28 14:13
> opt/openmpi/1.10/intel/bin/opal_wrapper
> 
> This is standard behavior; many of OpenMPI's commands are symlinks to a
> small handful of binaries.

Ah, I had missed that.

I have installed these packages:

openmpi-1.10.3-3.el7.x86_64
openmpi-devel-1.10.3-3.el7.x86_64
glibc-2.17-157.el7.x86_64

And still cannot reproduce the issue.  All I get is this:

$ LD_LIBRARY_PATH=/usr/lib64/openmpi/lib /opt/openmpi/1.10/intel/bin/mpicc
gcc: fatal error: no input files
compilation terminated.

I don't know anything about mpicc or openpmi, so I'd appreciate precise reproduction instructions.

Comment 9 Jim Foraker 2016-09-21 16:41:52 UTC
We are not using the RHEL-supplied OpenMPI RPMs.  We compile our own MPIs.  mpicc is a compiler wrapper that is required to compile MPI-enabled code.  It calls the compiler it was compiled against in turn to generate the actual object code.

Since your mpicc called gcc, I don't believe your environment is set up correctly; either your mpicc symlink is pointing to the wrong place, or your opal_wrapper binary is not the one provided; if operating correctly, it would attempt to run the Intel not GNU compiler (icc not gcc).  Since presumably you don't have icc installed, you should instead see a message like this (from a 7.2-based machine):

quartz187{foraker1}37: mpicc
--------------------------------------------------------------------------
The Open MPI wrapper compiler was unable to find the specified compiler
icc in your PATH.

Note that this compiler was either specified at configure time or in
one of several possible environment variables.
--------------------------------------------------------------------------

To run the provided binary, you will need to set your LD_LIBRARY_PATH to point at the libraries provided, NOT the ones out of the RHEL OpenMPI RPMs:

sh-4.2$ LD_LIBRARY_PATH=/opt/openmpi/1.10/intel/lib:/opt/intel/16.0/compiler/lib/intel64 ldd /opt/openmpi/1.10/intel/bin/mpicc
	linux-vdso.so.1 =>  (0x00002aaaaaaab000)
	libopen-pal.so.13 => /opt/openmpi/1.10/intel/lib/libopen-pal.so.13 (0x00002aaaaaaae000)
	libm.so.6 => /lib64/libm.so.6 (0x00002aaaaadf3000)
	libnuma.so.1 => /lib64/libnuma.so.1 (0x00002aaaab0f5000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab301000)
	librt.so.1 => /lib64/librt.so.1 (0x00002aaaab506000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaab70e000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002aaaab911000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaabb28000)
	libc.so.6 => /lib64/libc.so.6 (0x00002aaaabd44000)
	libimf.so => /opt/intel/16.0/compiler/lib/intel64/libimf.so (0x00002aaaac105000)
	libsvml.so => /opt/intel/16.0/compiler/lib/intel64/libsvml.so (0x00002aaaac604000)
	libirng.so => /opt/intel/16.0/compiler/lib/intel64/libirng.so (0x00002aaaad510000)
	libintlc.so.5 => /opt/intel/16.0/compiler/lib/intel64/libintlc.so.5 (0x00002aaaad882000)
	/lib64/ld-linux-x86-64.so.2 (0x0000555555554000)

That should produce the segfault.

Comment 10 Ben Woodard 2016-09-21 20:08:26 UTC
Created attachment 1203471 [details]
Missing file

This was the missing file from the already attached reproducer tar file.

The steps to reproduce are:
root@intel-wildcatpass-07 ~]# tar xvf broken-mpicc.tar
root@intel-wildcatpass-07 ~]# mv libopen-pal.so.13.0.3 opt/openmpi/1.10/intel/lib/
root@intel-wildcatpass-07 ~]#  LD_LIBRARY_PATH=opt/openmpi/1.10/intel/lib:opt/intel/16.0/compiler/lib/intel64 opt/openmpi/1.10/intel/bin/mpicc
Segmentation fault (core dumped)

To verify that it was the same problem I compared it to the previous backtrace:
root@intel-wildcatpass-07 ~]# gdb opt/openmpi/1.10/intel/bin/mpicc
<snip>
(gdb) set env LD_LIBRARY_PATH opt/openmpi/1.10/intel/lib:opt/intel/16.0/compiler/lib/intel64 
(gdb) r
Starting program: /root/opt/openmpi/1.10/intel/bin/mpicc 

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff684b65e in ?? ()
(gdb) bt
#0  0x00007ffff684b65e in ?? ()
#1  0x00007ffff7de9675 in elf_machine_rela (reloc=0x7ffff4ddfe38, reloc=0x7ffff4ddfe38, skip_ifunc=<optimized out>, 
    reloc_addr_arg=0x7ffff50438f0, version=0x0, sym=0x7ffff4ddaf58, map=0x7ffff7fe2af8)
    at ../sysdeps/x86_64/dl-machine.h:288
#2  elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>, nrelative=<optimized out>, 
    relsize=<optimized out>, reladdr=<optimized out>, map=0x7ffff7fe2af8) at do-rel.h:170
#3  _dl_relocate_object (scope=<optimized out>, reloc_mode=<optimized out>, consider_profiling=<optimized out>, 
    consider_profiling@entry=0) at dl-reloc.c:259
#4  0x00007ffff7de0792 in dl_main (phdr=<optimized out>, phdr@entry=0x400040, phnum=<optimized out>, phnum@entry=9, 
    user_entry=user_entry@entry=0x7fffffffe0b8, auxv=<optimized out>) at rtld.c:2192
#5  0x00007ffff7df3e36 in _dl_sysdep_start (start_argptr=start_argptr@entry=0x7fffffffe170, 
    dl_main=dl_main@entry=0x7ffff7dde820 <dl_main>) at ../elf/dl-sysdep.c:244
#6  0x00007ffff7de1a31 in _dl_start_final (arg=0x7fffffffe170) at rtld.c:318
#7  _dl_start (arg=0x7fffffffe170) at rtld.c:544
#8  0x00007ffff7dde1e8 in _start () from /lib64/ld-linux-x86-64.so.2
#9  0x0000000000000001 in ?? ()
#10 0x00007fffffffe40e in ?? ()
#11 0x0000000000000000 in ?? ()

Comment 11 Florian Weimer 2016-09-21 20:31:11 UTC
opt/intel/16.0/compiler/lib/intel64/libintlc.so.5 is not linked against libc.so.6, but uses symbols from libc.so.6.

This looks like swbz#20019.

We could perhaps avoid the crash, but the libintlc.so.5 object file is simply invalid.  The static linker did not process the undefined libc.so.6 symbol references, and as a result, all these symbol references are unversioned, and will not bind against the correct libc.so.6 symbols.

Comment 12 Florian Weimer 2016-09-21 20:48:40 UTC
Potential workaround: Preload libc.so.6 and the offending library.  This alters the resolution order and may get things to work:

LD_PRELOAD=/lib64/libc.so.6:/opt/intel/16.0/compiler/lib/intel64/libintlc.so.5 mpicc

However, this does not address the missing symbol versions.

Comment 13 Ben Woodard 2016-09-21 21:09:13 UTC
I don't think that the LD_PRELOAD is the direction that LLNL wants to take but it may be an acceptable workaround for the time being.

I recommended that they bounce this their intel compiler support person and file a bug against the intel compiler. I think that to resolve this issue Intel will need to change their build scripts so that when they build libintl.so they make sure that they link against libc
This should do two things:
1) it will make DT_NEEDED in the ELF file so that ld.so will pull it in first
2) it will properly version the builtin memmove

Also I just heard that the LD_PRELOAD trick does not work for LLNL.

Comment 14 Ben Woodard 2016-09-21 21:10:29 UTC
scratch that last comment the LD_PRELOAD trick does work after all.

Comment 16 Carlos O'Donell 2016-09-22 02:24:28 UTC
(In reply to Florian Weimer from comment #11)
> opt/intel/16.0/compiler/lib/intel64/libintlc.so.5 is not linked against
> libc.so.6, but uses symbols from libc.so.6.
> 
> This looks like swbz#20019.
> 
> We could perhaps avoid the crash, but the libintlc.so.5 object file is
> simply invalid.  The static linker did not process the undefined libc.so.6
> symbol references, and as a result, all these symbol references are
> unversioned, and will not bind against the correct libc.so.6 symbols.

It's not clear to me that libintlc.so.5 is invalid. If the shared object is not built against libc.so.6 then the author of the shared object accepts that all unversioned references will bind to the lastest versioned references found first during resolution, simliar to dlopen/dlsym.

I updated swbz#20019 to mention that perhaps glibc's dynamic loader should have used the relocation dependencies to correctly sort libc.so.6's initialization first.

I'm playing devil's advocate here, since libintlc.so.5 is a really weird object, and there isn't any serious justification for building it like it is, but at the same time glibc's dynamic loader has enough information to do better.

Comment 23 Carlos O'Donell 2016-10-25 15:18:05 UTC
This issue appears to only impact ICC 16 which is missing the DT_NEEDED on libc.so.6. The ICC 17 libintlc.so.5 has a DT_NEEDED in libc.so.6 which ensures the right ordering with respect to STT_GNU_IFUNC initialization. I'm lowering the priority of this issue to medium since the workaround is to upgrade to ICC 17 or preload libc.so.6 (see comment #12).

Comment 24 Carlos O'Donell 2016-10-25 15:19:29 UTC
The correct solution for this issue is as noted in the upstream bug, which is to make symbol dependency sorting a first class solution and thus have the topological sort of library initialization setup libc.so.6 first. This is not without some perils and needs quite a bit of upstream work first.

Comment 26 Kenneth Hoste 2017-01-19 11:11:25 UTC
After we (HPC-UGent) contacted Intel about this and put some pressure on it, they have issued a support article acknowledging the problem that includes an alternate (more feasible imho) workaround, which is to overwrite the libintlc.so.5 in the Intel v16 installation with a copy from the Intel v17 compilers, see https://software.intel.com/en-us/articles/intel-compiler-version-16-not-compatible-with-recent-libcso6 .

Comment 27 Florian Weimer 2017-01-25 14:09:46 UTC
(In reply to Kenneth Hoste from comment #26)
> After we (HPC-UGent) contacted Intel about this and put some pressure on it,
> they have issued a support article acknowledging the problem that includes
> an alternate (more feasible imho) workaround, which is to overwrite the
> libintlc.so.5 in the Intel v16 installation with a copy from the Intel v17
> compilers, see
> https://software.intel.com/en-us/articles/intel-compiler-version-16-not-
> compatible-with-recent-libcso6 .

Thanks for passing this along.

It turns out that issue can also appear with completely valid binaries.  I have attempted to fix this upstream, but there does not appear to be a convincing general fix, as I explained in this message:

  <https://sourceware.org/ml/libc-alpha/2017-01/msg00468.html>

We'll see if anyone else comes up with a better solution with less performance impact.

Comment 28 Florian Weimer 2017-01-25 14:13:26 UTC
*** Bug 1410576 has been marked as a duplicate of this bug. ***

Comment 31 Carlos O'Donell 2019-06-07 03:36:42 UTC
With no direct progress on this issue in upstream, and with RHEL 7 entering Maintenance Phase 1 at the end of 2019, I'm moving this issue to RHEL 8 for further consideration. It is an interesting issue that should be fixed, but requires some serious upstream work.

Comment 32 Florian Weimer 2020-03-02 15:49:00 UTC
Carlos and I have investigated the current state here.

(a) If a shared object uses glibc string functions, it must have a DT_NEEDED reference on libc.so.6.

(b) With the DT_NEEDED reference, libc.so.6 is always relocated first, before that object, so the relocation dependency on an IFUNC resolver is not a problem.

(c) There are other IFUNC resolvers in glibc (outside libc.so.6) which do not necessarily have this property because they interpose symbols in libc.so.6 (e.g., vfork in libpthread).  Their removal is tracked in bug 1748197. (This aspect covers the discussion referenced in comment 27; upstream chose a different resolution, not delayed IFUNC processing.)

We cannot fix binaries which violate (a), so there is nothing to do for this bug here.

Comment 33 Florian Weimer 2023-11-24 12:38:41 UTC
We now have the machinery to fix this. Reopening.

Comment 36 RHEL Program Management 2023-11-24 12:40:17 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.