Bug 1499012 - glibc: Restore compatibility with compilers which use vector registers for argument passing against x86-64 ABI
Summary: glibc: Restore compatibility with compilers which use vector registers for ar...
Status: CLOSED DUPLICATE of bug 1504969
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: glibc   
(Show other bugs)
Version: 7.4
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: glibc team
QA Contact: qe-baseos-tools
URL:
Whiteboard:
Keywords:
Depends On: 1504969
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-05 19:30 UTC by Deepu K S
Modified: 2017-12-21 19:45 UTC (History)
8 users (show)

Fixed In Version: glibc-2.17-217.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-11-03 08:24:34 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
'tinyfail.tgz', a small piece of code (attached) that illustrates the problem. (626 bytes, application/x-gzip)
2017-10-05 19:34 UTC, Deepu K S
no flags Details
compiled binaries (5.83 MB, application/x-gzip)
2017-10-09 09:16 UTC, Deepu K S
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Sourceware 21236 None None None 2017-10-05 19:40 UTC
Sourceware 21265 None None None 2017-11-03 08:24 UTC
Red Hat Bugzilla 1377895 None None None Never
Red Hat Bugzilla 1421155 None None None Never
Red Hat Bugzilla 1527904 None None None Never
Red Hat Bugzilla 1528418 None None None Never

Internal Trackers: 1377895 1421155 1527904 1528418

Description Deepu K S 2017-10-05 19:30:13 UTC
Description of problem:
We have some problems with running our software on the newest Red Hat Enterprise Linux version 7.4. 
Our software consists of a rather large amount of executables and dynamic libraries compiled mostly with Intel icpc and Intel ifort compilers.
The problems usually manifest themselves as not a numbers (nan) appearing in various branches of the code.
The latest example fails when built and run on 7.4. However, that same executable (built on 7.4) runs fine on 7.3.

Please see the attached Small reproducer 'tinyfail.tgz' .

The issue seems to be disussed in upstream BZ #21236 and Red BugZilla #1421155
Bug 21236 - NaN generation by optimized math functions
https://sourceware.org/bugzilla/show_bug.cgi?id=21236

Upstream conclusion is that ICC violates the ps-abi for x86_64 and there is no fault in glibc. As a workaround, it is suggested to set environment variable LD_BIND_NOW=1 .

However, the workaround doesn't seem to work for all system applications.
The workaround mostly works for the software; however, it does not help with the small repro example which was attached to the request. Equally it makes it impossible to use our version of Python - it either crashes or exits with "unresolved symbol" messages when used under "LD_BIND_NOW=1" environment (we build and distribute Python, and it is built with Intel compilers).

Existing analysis from Intel for the small repro the code lands at some point in the different routines on redhat 7.3 and 7.4:
0x00007ffff7df0984 in _dl_runtime_resolve_avx_slow () from /lib64/ld-linux-x86-64.so.2  (rh7.4)
0x00007ffff7df2070 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2 (rh7.3)

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux 7.4
glibc-2.17-196.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
Reproducer 'tinyfail.tgz' attached to the bug.

Actual results:
Software giving NaNs unexpectedly.

Expected results:
No NaN errors.

Additional info:

Comment 2 Deepu K S 2017-10-05 19:34 UTC
Created attachment 1334984 [details]
'tinyfail.tgz', a small piece of code (attached) that illustrates the problem.

The shell script makeSymmetryFinder , which can be found inside the archive I just attached,  (and may need some editing to setup location of the intel compiler in the first line) creates two executables  out of the two C++ files :

./DynamicSF  where the object files are linked dynamically, and ./StaticSF where the same object files are linked statically. The compilation options in the file replicate the options used in our production build.

In our tests we have done the following:
1)            Compiled the code with Intel compiler version :
>/opt/intel/compilers_and_libraries/linux/bin/intel64/icpc -v
icpc version 16.0.1 (gcc version 4.4.7 compatibility)
                on the RHEL 6.5 Machine:

>uname -a
Linux ms-rh6-cam2 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
>cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.5 (Santiago)
>cat /proc/cpuinfo
..
model name      : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

2)            Run on the RHEL 7.4 machine
>cat /proc/cpuinfo
..
model name      : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

>cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)

./StaticSF runs as expected and produces
##  6.516050 <<<
While 
./DynamicSF fails on RH7.4 with (however, it works correctly on any other Linux OS, including earlier RHEL versions)
##  -nan <<<

3)The failure can be eliminated if any of the following compilation options is chosen:
a)            optimization –O1 instead of –O2
b)            fp-model is set to e.g. double instead of fast (default) < -fp-model double
c)            intended architecture:  if among the options  to  -ax there is at least one of  CORE-AVX-I,SSE4.2 or AVX

Comment 5 Carlos O'Donell 2017-10-05 20:35:59 UTC
Could you please compile tinyfail.tgz sources and provide the compiled binaries?

Could the customer please try running with LD_BIND_NOW=1?

Comment 6 Deepu K S 2017-10-09 09:16 UTC
Created attachment 1336261 [details]
compiled binaries

When you unpack the files on an old machine, in the directory with the files you can do the following to demonstrate that the dynamically loaded executable from Intel 16 produces NAN, and crashes when LD_BIND_NOW is set.

Yours

arts-rhel74> ll
total 18088
-rwxr--r-- 1 vmn1 users       83 Sep 28 14:48 a.cpp*
-rwxr--r-- 1 vmn1 users    10952 Sep 29 13:02 a.o*
-rwxr--r-- 1 vmn1 users    31800 Sep 29 13:02 DynamicSF*
-rwxr--r-- 1 vmn1 users  3390120 Oct  5 11:22 libimf.so*
-rwxr--r-- 1 vmn1 users   468217 Oct  5 11:22 libintlc.so.5*
-rwxr--r-- 1 vmn1 users    36375 Oct  5 11:22 libirng.so*
-rwxr--r-- 1 vmn1 users 14331673 Oct  5 11:22 libsvml.so*
-rwxr--r-- 1 vmn1 users    36232 Sep 29 13:02 libSymmetryFinder.so*
-rwxr--r-- 1 vmn1 users      480 Sep 29 13:02 makeSymmetryFinder*
-rwxr--r-- 1 vmn1 users    58808 Sep 29 13:02 StaticSF*
-rwxr--r-- 1 vmn1 users      319 Sep 28 15:41 symm_cell.cpp*
-rwxr--r-- 1 vmn1 users    32216 Sep 29 13:02 symm_cell.o*
arts-rhel74> setenv LD_LIBRARY_PATH .
arts-rhel74> DynamicSF
##  -nan <<<
arts-rhel74> StaticSF
##  6.516050 <<<
arts-rhel74> setenv LD_BIND_NOW 1
arts-rhel74> DynamicSF
Segmentation fault (core dumped)

Comment 7 Florian Weimer 2017-10-09 09:57:08 UTC
The problem here is that libintlc.so.5 is linked incorrectly because it references libc.so.6 symbols, but is not linked against it:

$ eu-readelf -d libintlc.so.5
Dynamic segment contains 21 entries:
 Addr: 0x0000000000269588  Offset: 0x069588  Link to section: [ 3] '.dynstr'
  Type              Value
  SONAME            Library soname: [libintlc.so.5]
  INIT              0x0000000000007058
  FINI              0x000000000005f498
  HASH              0x0000000000000158
  STRTAB            0x0000000000003450
  SYMTAB            0x0000000000000c30
  STRSZ             6604 (bytes)
  SYMENT            24 (bytes)
  PLTGOT            0x0000000000269798
  PLTRELSZ          5160 (bytes)
  PLTREL            RELA
  JMPREL            0x0000000000005c30
  RELA              0x0000000000004e20
  RELASZ            3600 (bytes)
  RELAENT           24 (bytes)
  RELACOUNT         137
  NULL              
…
$ eu-readelf --symbols=.dynsym libintlc.so.5 | grep UNDEF | head
    0: 0000000000000000      0 NOTYPE  LOCAL  DEFAULT    UNDEF 
   12: 0000000000000000      0 FUNC    WEAK   DEFAULT    UNDEF _Unwind_GetIP
   15: 0000000000000000      0 NOTYPE  GLOBAL DEFAULT    UNDEF __strcpy_chk
   30: 0000000000000000      0 NOTYPE  GLOBAL DEFAULT    UNDEF __xpg_basename
   31: 0000000000000000      0 NOTYPE  GLOBAL DEFAULT    UNDEF memmove
   33: 0000000000000000      0 NOTYPE  GLOBAL DEFAULT    UNDEF snprintf
   45: 0000000000000000      0 FUNC    WEAK   DEFAULT    UNDEF _Unwind_GetRegionStart
   48: 0000000000000000      0 NOTYPE  GLOBAL DEFAULT    UNDEF getenv
   59: 0000000000000000      0 FUNC    WEAK   DEFAULT    UNDEF _Unwind_Backtrace
   66: 0000000000000000      0 NOTYPE  GLOBAL DEFAULT    UNDEF getpid

See bug 1377895.  This is a different Intel toolchain bug.  I believe Intel has released a compiler update which resolves this issue.  Switching to the fixed libintlc.so.5 library will resolve the crash without having to recompile any applications.

Comment 11 Florian Weimer 2017-11-03 08:24:34 UTC
We plan to deliver a fix for this issue as part of the fix for bug 1504969, which went into glibc-2.17-217.el7.  We are backporting these upstream commits:

commit b52b0d793dcb226ecb0ecca1e672ca265973233c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Oct 20 11:00:08 2017 -0700

    x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]

commit 0ac8ee53e8efbfd6e1c37094b4653f5c2dad65b5
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Aug 26 08:57:42 2016 -0700

    X86-64: Correct CFA in _dl_runtime_resolve

*** This bug has been marked as a duplicate of bug 1504969 ***

Comment 12 Florian Weimer 2017-11-03 16:58:19 UTC
An unsupported, untest public preview build with this fix is available here:

  https://copr.fedorainfracloud.org/coprs/fweimer/glibc-rhel-7.5/

Comment 13 Florian Weimer 2017-11-30 13:59:30 UTC
Updated glibc packages for this issue have been released for Red Hat Enterprise Linux 7.4:

https://access.redhat.com/errata/RHBA-2017:3296

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.