Bug 1670620

Summary:

glibc: segv allocating TLS in runtime linker on ppc64le

Product:

Red Hat Enterprise Linux 7

Reporter:

Ben Woodard <woodard>

Component:

glibc

Assignee:

glibc team <glibc-bugzilla>

Status:

CLOSED DUPLICATE

QA Contact:

qe-baseos-tools-bugs

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

7.6-Alt

CC:

ashankar, codonell, dj, foraker1, fweimer, mnewsome, pfrankli, tgummels, woodard

Target Milestone:

Target Release:

---

Hardware:

ppc64le

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

CORAL

Last Closed:

2020-01-10 14:53:07 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1599298

Attachments:

Description	Flags
the strace for the crash	none
core file running under the new glibc packages	none
actual executable	none

Description Ben Woodard 2019-01-29 22:17:55 UTC

Description of problem:
There is a null pointer dereference in the TLS allocation. This seems to happen when using hpctoolkit on a test code called laghos along with CUPTI 

According to the segv the null pointer dereference ends up being the  listp->len of the for loop inside the while loop here.

_dl_allocate_tls_init (void *result)
{
  ...

  listp = GL(dl_tls_dtv_slotinfo_list);
  while (1)
    {
      size_t cnt;

      for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt)
     ...
     listp = listp->next;
      assert (listp != NULL);
    }
The thing that seems wrong is that the assert gets compiled away. There is an analogous loop elsewhere that is similar, but contains a breaking condition to handle the listp = NULL case at the end of the loop. 

struct link_map *
_dl_update_slotinfo (unsigned long int req_modid)
{
  ...
  listp =  GL(dl_tls_dtv_slotinfo_list);
      do
        {
          for (size_t cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt)
            {
              ...

              if (modid == req_modid)
                the_map = map;
            }

          total += listp->len;
        }
      while ((listp = listp->next) != NULL);
  ...
}

The disassembly of the failing function:
   1a3f0:       74 00 08 7f     cntlzd  r8,r24
   1a3f4:       00 00 56 e9     ld      r10,0(r22)     // listp->len? where it segv's
   1a3f8:       82 d1 08 79     rldicl  r8,r8,58,6
   1a3fc:       40 50 a8 7f     cmpld   cr7,r8,r10
   1a400:       e0 00 9c 40     bge     cr7,1a4e0 <_dl_allocate_tls+0x1a0>
   1a404:       14 c2 e8 7c     add     r7,r8,r24
   1a408:       40 48 a7 7f     cmpld   cr7,r7,r9
   1a40c:       f4 00 9d 41     bgt     cr7,1a500 <_dl_allocate_tls+0x1c0>
   1a410:       01 00 c8 3b     addi    r30,r8,1
   1a414:       e4 26 dd 7b     rldicr  r29,r30,4,59
   1a418:       14 ea b6 7f     add     r29,r22,r29
   1a41c:       08 00 bd 3b     addi    r29,r29,8
   1a420:       68 00 00 48     b       1a488 <_dl_allocate_tls+0x148>
   1a424:       00 00 00 60     nop
   1a428:       00 00 00 60     nop
   1a42c:       00 00 42 60     ori     r2,r2,0
   1a430:       20 04 9f e8     ld      r4,1056(r31)
   1a434:       28 04 bf e8     ld      r5,1064(r31)
   1a438:       14 1a 79 7c     add     r3,r25,r3
   1a43c:       0d b3 00 48     bl      25748 <__mempcpy+0x8>
   1a440:       00 00 00 60     nop
   1a444:       28 04 3f e9     ld      r9,1064(r31)
   1a448:       30 04 bf e8     ld      r5,1072(r31)
   1a44c:       00 00 80 38     li      r4,0
   1a450:       50 28 a9 7c     subf    r5,r9,r5
   1a454:       ed b0 00 48     bl      25540 <memset>
   1a458:       00 00 00 60     nop
   1a45c:       00 00 00 60     nop
   1a460:       28 91 22 e9     ld      r9,-28376(r2)
   1a464:       00 00 56 e9     ld      r10,0(r22)
   1a468:       40 50 be 7f     cmpld   cr7,r30,r10
   1a46c:       78 f3 c8 7f     mr      r8,r30
   1a470:       70 00 9c 40     bge     cr7,1a4e0 <_dl_allocate_tls+0x1a0>
   1a474:       14 f2 18 7d     add     r8,r24,r30
   1a478:       10 00 bd 3b     addi    r29,r29,16
   1a47c:       40 48 a8 7f     cmpld   cr7,r8,r9
   1a480:       80 00 9d 41     bgt     cr7,1a500 <_dl_allocate_tls+0x1c0>
   1a484:       01 00 de 3b     addi    r30,r30,1
   1a488:       00 00 fd eb     ld      r31,0(r29)
   1a48c:       00 00 bf 2f     cmpdi   cr7,r31,0
   1a490:       d8 ff 9e 41     beq     cr7,1a468 <_dl_allocate_tls+0x128>
   1a494:       f8 ff 3d e9     ld      r9,-8(r29)
   1a498:       40 48 b7 7f     cmpld   cr7,r23,r9
   1a49c:       08 00 9c 40     bge     cr7,1a4a4 <_dl_allocate_tls+0x164>
   1a4a0:       78 4b 37 7d     mr      r23,r9
   1a4a4:       50 04 3f e9     ld      r9,1104(r31)
   1a4a8:       e4 26 29 79     rldicr  r9,r9,4,59
   1a4ac:       14 4a 5c 7d     add     r10,r28,r9
   1a4b0:       2a 49 5c 7f     stdx    r26,r28,r9
   1a4b4:       08 00 6a 9b     stb     r27,8(r10)
   1a4b8:       48 04 7f e8     ld      r3,1096(r31)
   1a4bc:       02 00 23 39     addi    r9,r3,2
   1a4c0:       01 00 a9 2b     cmpldi  cr7,r9,1
   1a4c4:       6c ff 9d 41     bgt     cr7,1a430 <_dl_allocate_tls+0xf0>
   1a4c8:       00 00 56 e9     ld      r10,0(r22)
   1a4cc:       00 00 00 60     nop
   1a4d0:       28 91 22 e9     ld      r9,-28376(r2)
   1a4d4:       78 f3 c8 7f     mr      r8,r30
   1a4d8:       40 50 be 7f     cmpld   cr7,r30,r10
   1a4dc:       98 ff 9c 41     blt     cr7,1a474 <_dl_allocate_tls+0x134>
   1a4e0:       14 42 18 7f     add     r24,r24,r8
   1a4e4:       40 48 b8 7f     cmpld   cr7,r24,r9
   1a4e8:       18 00 dc 40     bge-    cr7,1a500 <_dl_allocate_tls+0x1c0>
Note that there is no loop termination in the last two instructions here.
   1a4ec:       08 00 d6 ea     ld      r22,8(r22) // listp = listp->next?
   1a4f0:       00 ff ff 4b     b       1a3f0 <_dl_allocate_tls+0xb0>


Version-Release number of selected component (if applicable):
glibc-2.17-222.el7_6.3.x86_64


How reproducible:
always

Steps to Reproduce:
Still pending.

Comment 5 Carlos O'Donell 2019-01-30 20:24:49 UTC

Thanks for the initial analysis. How can we get you a new glibc to test? If we put together a testfix I assume that we should build it for rhel-7.6 and ppc64le? Can we send you rpms to install and test?

Comment 6 Ben Woodard 2019-01-30 20:37:21 UTC

Because of the machine that I need to test it on, it would be easiest if you just made a git branch that I could pull build and run that like any normal glibc test build. 

If that is too difficult, I can arrange to get root on the system and then take a few nodes out of the cluster and install a custom system image with test glibc RPMs on them. That is fine, it is just more work for me.

Comment 7 Carlos O'Donell 2019-01-30 21:16:50 UTC

(In reply to Ben Woodard from comment #6)
> Because of the machine that I need to test it on, it would be easiest if you
> just made a git branch that I could pull build and run that like any normal
> glibc test build. 

I'm worried this will not yield a correct result at the customer site e.g. running with wrong libraries etc.

> If that is too difficult, I can arrange to get root on the system and then
> take a few nodes out of the cluster and install a custom system image with
> test glibc RPMs on them. That is fine, it is just more work for me.

This is what I strongly recommend. We really really really want a 100% bullet proof assurance that you're using all parts of the new runtime. To avoid making any mistakes it's best to install a testfix glibc.

Can you setup those nodes and verify you can reproduce the problem on them?

I'm building you a testfix with assertions enabled to see if the TLS assertions trigger.

Comment 8 Ben Woodard 2019-02-04 14:57:21 UTC

While waiting for the affected team to assemble a reproducer for me. I compiled up a the affected code and looked at what parts of it use TLS. using this to gather data.

$ ldd laghos | sed -e 's/.*>//' -e 's/.0x.*//' | while read i;do echo --- $i; eu-readelf -S $i;done | egrep tbss\|---\|tdata

Then I annotated the data from ldd (there is no difference between the T and the t -- I just bumped the caps-lock key wile copying from window to another.)

$ ldd laghos 
	linux-vdso.so.1 (0x00007ffd1efeb000)
	libHYPRE-2.15.1.so => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/hypre-2.15.1-dzfmkgkwd3zakwp5p4y4i33j7qxfdeop/lib/libHYPRE-2.15.1.so (0x00007f22c72f4000)
	libopenblas.so.0 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/openblas-0.3.5-2xivefu4hjfalpivsdto7iqndctk2jxo/lib/libopenblas.so.0 (0x00007f22c673e000)
T	libmetis.so => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/metis-5.1.0-z2scdq3fgdep4v7e5ivoukxl5ismdua3/lib/libmetis.so (0x00007f22c66ce000)
	librt.so.1 => /usr/lib64/librt.so.1 (0x00007f22c66c4000)
	libz.so.1 => /usr/lib64/libz.so.1 (0x00007f22c66aa000)
	libmpi_cxx.so.40 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/openmpi-3.1.3-45urwiozdivamagc2h6norga22wgmr7b/lib/libmpi_cxx.so.40 (0x00007f22c668d000)
	libmpi.so.40 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/openmpi-3.1.3-45urwiozdivamagc2h6norga22wgmr7b/lib/libmpi.so.40 (0x00007f22c6423000)
T	libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f22c628b000)
	libm.so.6 => /usr/lib64/libm.so.6 (0x00007f22c6105000)
	libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f22c60ea000)
	libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f22c60c8000)
t	libc.so.6 => /usr/lib64/libc.so.6 (0x00007f22c5f02000)
	libgfortran.so.5 => /usr/lib64/libgfortran.so.5 (0x00007f22c5c85000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f22c7704000)
	libopen-rte.so.40 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/openmpi-3.1.3-45urwiozdivamagc2h6norga22wgmr7b/lib/libopen-rte.so.40 (0x00007f22c5b55000)
	libopen-pal.so.40 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/openmpi-3.1.3-45urwiozdivamagc2h6norga22wgmr7b/lib/libopen-pal.so.40 (0x00007f22c5948000)
	libutil.so.1 => /usr/lib64/libutil.so.1 (0x00007f22c5943000)
	libhwloc.so.5 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/hwloc-1.11.11-42ceqbpi2stihgk4eqhcemifgcsvkjxa/lib/libhwloc.so.5 (0x00007f22c5902000)
t	libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f22c58f4000)
t	libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f22c58c9000)
	libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0 (0x00007f22c58be000)
	libxml2.so.2 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/libxml2-2.9.8-uypnlww3lv5zp4qqu2l5bsbwlx3lpe2c/lib/libxml2.so.2 (0x00007f22c5759000)
	libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f22c5753000)
	liblzma.so.5 => /usr/lib64/liblzma.so.5 (0x00007f22c572a000)
	libiconv.so.2 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/libiconv-1.15-wu2oqeyswzh3wq6pkwyjqmm5vdln23qy/lib/libiconv.so.2 (0x00007f22c562b000)
	libquadmath.so.0 => /usr/lib64/libquadmath.so.0 (0x00007f22c55e6000)
t	libmount.so.1 => /usr/lib64/libmount.so.1 (0x00007f22c5589000)
t	libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007f22c5536000)
t	libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007f22c552d000)
t	libselinux.so.1 => /usr/lib64/libselinux.so.1 (0x00007f22c54fe000)
	libpcre2-8.so.0 => /usr/lib64/libpcre2-8.so.0 (0x00007f22c5478000)

So there appears to be a considerable amount of TLS.

[ben@Mustang Work]$ grep ^[tT] tls-bug.txt 
T	libmetis.so => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/metis-5.1.0-z2scdq3fgdep4v7e5ivoukxl5ismdua3/lib/libmetis.so (0x00007f22c66ce000)
T	libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f22c628b000)
t	libc.so.6 => /usr/lib64/libc.so.6 (0x00007f22c5f02000)
t	libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f22c58f4000)
t	libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f22c58c9000)
t	libmount.so.1 => /usr/lib64/libmount.so.1 (0x00007f22c5589000)
t	libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007f22c5536000)
t	libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007f22c552d000)
t	libselinux.so.1 => /usr/lib64/libselinux.so.1 (0x00007f22c54fe000)

This was on the x86_64 version. The original version uses OpenMPI, hpctoolkit, and CUPTI. So it is probably considerably more complicated. I was going to look into on a bigger system to see the overall startup with regards to threading to get a handle on that.

Comment 9 Carlos O'Donell 2019-02-04 21:09:04 UTC

Ben,

I have a rhel-7.6 build with assert's enabled.

Create the following /etc/yum.repos.d/rhbz1670620.repo
~~~
[rhbz1670620]
name=RHEL 7.6 testfix for bug 1670620
baseurl=https://people.redhat.com/codonell/rhel-7.6-rhbz1670620
enabled=1
gpgcheck=0
protect=1
~~~

You should be able to upgrade to the testfix glibc.

As we test things I'll just keep bumping the testfix # and yum upgrade should work.

Comment 10 Carlos O'Donell 2019-02-04 21:18:40 UTC

I tested the assert enabled rpms on a POWER8 VM system by installing them and rebooting, and I didn't see any functional problems, so it should be safe to use on another ppc64le system.

Comment 11 Ben Woodard 2019-02-06 19:31:18 UTC

I got the instructions from J M-C on how to reproduce this. I’ll work on trying to reproduce them on LLNL’s system. 

Keren and I finally have a simple to build and run TLS bug reproducer that you can use at LLNL. It took a little longer to integrate our hpctoolkit GPU prototype into spack, but it will help us immensely going forward.

git clone https://github.com/jmellorcrummey/bugs

Follow the simple directions in 

bugs/tls-bug-reproducer/README.md

The only thing that I didn’t properly account for in the repository is that it assumes that my spack compiler settings in ./spack/linux/compilers.yaml include the following



Using a basic spack repository, used the following to get my ./spack world set up 
	module load gcc/7.3.1
	spack compiler find

When you follow the directions in the repository I have provided, it will download and build a custom spack repository from github.com/jmellorcrummey/spack. This repository includes some private modifications to several packages to build our GPU prototype.

Let us know if you have any questions. When you run

make tls-bug

You will see that using hpcrun to monitor a Laghos execution dies. 

In the tls-bug directory, the Makefile in the tls-bug supports 

make inspect

which will run a gdb on the Laghos binary (supplying its obscure path from my build world) and the corefile, which will let you inspect the wreckage after the bug triggers. On rzansel, apparently debug symbols are available, so I see the failed execution in the loop where listp == NULL as it tries to dereference listp->len.

You should be able to download and build this on any LLNL P9 system and replicate the bug.

I added a “debug” target to the Makefile in the tls-bug directory

	See https://github.com/jmellorcrummey/bugs/blob/master/tls-bug-reproducer/tls-bug/Makefile

The README.md file in that directory describes how to use gdb with hpcrun

	See https://github.com/jmellorcrummey/bugs/blob/master/tls-bug-reproducer/tls-bug/README.md

Comment 13 Ben Woodard 2019-02-07 01:22:52 UTC

confirmed that J M-C's reproducer works for me.

$ LD_DEBUG=all LD_DEBUG_OUTPUT=ldout !!
LD_DEBUG=all LD_DEBUG_OUTPUT=ldout mpirun -np 1 hpcrun -e nvidia-cuda ../laghos/Laghos/cuda/laghos -p 0 -m ../laghos/Laghos/data/square01_quad.mesh -rs 3 -tf 0.75 -pa

       __                __                 
      / /   ____  ____  / /_  ____  _____   
     / /   / __ `/ __ `/ __ \/ __ \/ ___/ 
    / /___/ /_/ / /_/ / / / / /_/ (__  )    
   /_____/\__,_/\__, /_/ /_/\____/____/  
               /____/                       

Options used:
   --mesh ../laghos/Laghos/data/square01_quad.mesh
   --refine-serial 3
   --refine-parallel 0
   --problem 0
   --order-kinematic 2
   --order-thermo 1
   --ode-solver 4
   --t-final 0.75
   --cfl 0.5
   --cg-tol 1e-08
   --cg-max-steps 300
   --max-steps -1
   --partial-assembly
   --no-visualization
   --visualization-steps 5
   --no-visit
   --no-print
   --outputfilename results/Laghos
   --no-uvm
   --no-aware
   --no-hcpo
   --no-sync
   --no-share
[laghos] MPI is NOT CUDA aware
[laghos] CUDA device count: 4
[laghos] Rank_0 => Device_0 (Tesla V100-SXM2-16GB:sm_7.0)
[laghos] Cartesian partitioning will be used
[laghos] pmesh->GetNE()=256
Number of kinematic (position, velocity) dofs: 2178
Number of specific internal energy dofs: 1024

[lassen708:108589] *** Process received signal ***
[lassen708:108589] Signal: Segmentation fault (11)
[lassen708:108589] Signal code: Address not mapped (1)
[lassen708:108589] Failing at address: (nil)
[lassen708:108589] [ 0] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(+0x80f8)[0x2000001180f8]
[lassen708:108589] [ 1] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000504d8]
[lassen708:108589] [ 2] /lib64/ld64.so.2(_dl_allocate_tls+0x100)[0x20000001a440]
[lassen708:108589] [ 3] /lib64/libpthread.so.0(pthread_create+0x9b0)[0x2000014b9b00]
[lassen708:108589] [ 4] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(pthread_create+0x2a4)[0x200000126414]
[lassen708:108589] [ 5] /usr/lib64/nvidia/libcuda.so.1(+0x238008)[0x200000428008]
[lassen708:108589] [ 6] /usr/lib64/nvidia/libcuda.so.1(+0x434440)[0x200000624440]
[lassen708:108589] [ 7] /usr/lib64/nvidia/libcuda.so.1(+0x3e3c4c)[0x2000005d3c4c]
[lassen708:108589] [ 8] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x12f424)[0x200001cff424]
[lassen708:108589] [ 9] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x112bdc)[0x200001ce2bdc]
[lassen708:108589] [10] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x117578)[0x200001ce7578]
[lassen708:108589] [11] /usr/lib64/nvidia/libcuda.so.1(+0x3d3b0c)[0x2000005c3b0c]
[lassen708:108589] [12] /usr/lib64/nvidia/libcuda.so.1(+0x1ef3fc)[0x2000003df3fc]
[lassen708:108589] [13] /usr/lib64/nvidia/libcuda.so.1(+0x392f54)[0x200000582f54]
[lassen708:108589] [14] /usr/lib64/nvidia/libcuda.so.1(+0xe5588)[0x2000002d5588]
[lassen708:108589] [15] /usr/lib64/nvidia/libcuda.so.1(+0xe5728)[0x2000002d5728]
[lassen708:108589] [16] /usr/lib64/nvidia/libcuda.so.1(cuLaunchKernel+0x24c)[0x2000004805ec]
[lassen708:108589] [17] /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2(+0xe4c4)[0x200000fce4c4]
[lassen708:108589] [18] /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2(cudaLaunchKernel+0x230)[0x20000101be20]
[lassen708:108589] [19] ../laghos/Laghos/cuda/laghos[0x10064238]
[lassen708:108589] [20] ../laghos/Laghos/cuda/laghos[0x1002f178]
[lassen708:108589] [21] ../laghos/Laghos/cuda/laghos[0x1002b438]
[lassen708:108589] [22] ../laghos/Laghos/cuda/laghos[0x100188c0]
[lassen708:108589] [23] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(monitor_main+0x128)[0x200000122da8]
[lassen708:108589] [24] /lib64/libc.so.6(+0x25100)[0x200001515100]
[lassen708:108589] [25] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000015152f4]
[lassen708:108589] [26] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(__libc_start_main+0xf0)[0x200000121e30]
[lassen708:108589] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node lassen708 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Comment 14 Ben Woodard 2019-02-07 01:37:34 UTC

I haven't had a chance to test with the new glibc provided above. Attached is the strace.out

Comment 15 Ben Woodard 2019-02-07 01:39:40 UTC

Created attachment 1527716 [details]
the strace for the crash

This can be used to see how the processes are started and related.

Comment 16 Ben Woodard 2019-02-07 01:43:54 UTC

The output of the run above LD_DEBUG is too big to attach to the bug and so I put the output at:

http://ssh.bencoyote.net/~ben/ldout.tar.gz

I'll spend some time looking at the output to try to grok what is going on. Things like how many threads and how they start.

Comment 17 Ben Woodard 2019-02-07 14:34:26 UTC

One observation that I can make even early as I analyze this is this feels more like a corruption issue rather than a data race. With most data races the problem is highly sensitive to interruption and any sort of jostling of the timing would move the problem around. This particular problem didn't seem affected by either LD_DEBUG=all which generated about 1.9GB of data nor strace.

Comment 18 Ben Woodard 2019-02-07 15:10:29 UTC

Some additional notes from the original reporters:
Some variable values inside _dl_allocate_tls_init; others were optimized out.

For the call where _dl_allocate_tls_init failed, the dtv had 69 entries in it.

I looked at /proc/<pid>/maps and as I recall there were 98 things mentioned that had executable code, i.e. their line in maps had ‘r-x’ in it. I thought were all unique. I didn’t track down why 98 != 69. Anyway, I don’t understand all of the pieces in the _dl_allocate_tls_init code.

libhpcrun.so.0.0.0 has thread local data.
The libmonitor library is a preloaded library that wraps pthread_create.

The problem appears on the fourth call to _dl_allocate_tls_init, this there are only a few threads involved. You can watch each get created with a breakpoint in pthread_create and see how they are created. 

The problematic thread that causes the error when it is initialized is created by NVIDIA’s cuLaunchKernel, which is in a closed source library. I believe that cuLaunchKernel only creates a thread if NVIDIA’s  CUPTI library is involved to monitor GPU activity.

Interestingly the whole setup works fine profiling LULESH, but both the raja or CUDA version of laghos both fail. Both of these are designed as test apps to model real HPC apps.

Comment 20 Ben Woodard 2019-02-12 22:32:08 UTC

[butte5:47379] *** Process received signal ***
[butte5:47379] Signal: Segmentation fault (11)
[butte5:47379] Signal code: Address not mapped (1)
[butte5:47379] Failing at address: (nil)
[butte5:47379] [ 0] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(+0x80f8)[0x2000001180f8]
[butte5:47379] [ 1] [0x2000000504d8]
[butte5:47379] [ 2] /lib64/ld64.so.2(_dl_allocate_tls+0x100)[0x20000001b480]
[butte5:47379] [ 3] /lib64/libpthread.so.0(pthread_create+0x9b0)[0x2000014b9ba0]
[butte5:47379] [ 4] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(pthread_create+0x2a4)[0x200000126414]
[butte5:47379] [ 5] /usr/lib64/nvidia/libcuda.so.1(+0x238008)[0x200000428008]
[butte5:47379] [ 6] /usr/lib64/nvidia/libcuda.so.1(+0x434440)[0x200000624440]
[butte5:47379] [ 7] /usr/lib64/nvidia/libcuda.so.1(+0x3e3c4c)[0x2000005d3c4c]
[butte5:47379] [ 8] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x12f424)[0x200001cff424]
[butte5:47379] [ 9] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x112bdc)[0x200001ce2bdc]
[butte5:47379] [10] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x117578)[0x200001ce7578]
[butte5:47379] [11] /usr/lib64/nvidia/libcuda.so.1(+0x3d3b0c)[0x2000005c3b0c]
[butte5:47379] [12] /usr/lib64/nvidia/libcuda.so.1(+0x1ef3fc)[0x2000003df3fc]
[butte5:47379] [13] /usr/lib64/nvidia/libcuda.so.1(+0x392f54)[0x200000582f54]
[butte5:47379] [14] /usr/lib64/nvidia/libcuda.so.1(+0xe5588)[0x2000002d5588]
[butte5:47379] [15] /usr/lib64/nvidia/libcuda.so.1(+0xe5728)[0x2000002d5728]
[butte5:47379] [16] /usr/lib64/nvidia/libcuda.so.1(cuLaunchKernel+0x24c)[0x2000004805ec]
[butte5:47379] [17] /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2(+0xe4c4)[0x200000fce4c4]
[butte5:47379] [18] /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2(cudaLaunchKernel+0x230)[0x20000101be20]
[butte5:47379] [19] ../laghos/Laghos/cuda/laghos[0x10064238]
[butte5:47379] [20] ../laghos/Laghos/cuda/laghos[0x1002f178]
[butte5:47379] [21] ../laghos/Laghos/cuda/laghos[0x1002b438]
[butte5:47379] [22] ../laghos/Laghos/cuda/laghos[0x100188c0]
[butte5:47379] [23] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(monitor_main+0x128)[0x200000122da8]
[butte5:47379] [24] /lib64/libc.so.6(+0x25100)[0x200001515100]
[butte5:47379] [25] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000015152f4]
[butte5:47379] [26] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(__libc_start_main+0xf0)[0x200000121e30]
[butte5:47379] *** End of error message ***

[ben@butte5:tls-bug]$ gdb ../laghos/Laghos/cuda/laghos butte5-laghos-47379.core 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos...(no debugging symbols found)...done.
[New LWP 47379]
[New LWP 47426]
[New LWP 47427]
[New LWP 47428]
[New LWP 47449]
[New LWP 47450]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `../laghos/Laghos/cuda/laghos -p 0 -m ../laghos/Laghos/data/square01_quad.mesh -'.
Program terminated with signal 11, Segmentation fault.
#0  0x000020000001b434 in _dl_allocate_tls_init (result=0x20002261a140)
    at dl-tls.c:471
471	      for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt)
warning: File "/usr/tce/packages/gcc/gcc-4.9.3/gnu/lib64/libstdc++.so.6.0.20-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py:/usr/lib/golang/src/pkg/runtime/runtime-gdb.py".
To enable execution of this file add
	add-auto-load-safe-path /usr/tce/packages/gcc/gcc-4.9.3/gnu/lib64/libstdc++.so.6.0.20-gdb.py
line to your configuration file "/g/g0/ben/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/g/g0/ben/.gdbinit".
For more information about this security protection see the
Missing separate debuginfos, use: debuginfo-install libibumad-43.1.1.MLNX20171122.0eb0969-0.1.43401.1.ppc64le libibverbs-41mlnx1-OFED.4.3.2.1.6.43401.1.ppc64le libmlx4-41mlnx1-OFED.4.1.0.1.0.43401.1.ppc64le libmlx5-41mlnx1-OFED.4.3.4.0.3.43401.1.ppc64le libnl3-3.2.28-4.el7.ppc64le librxe-41mlnx1-OFED.4.4.2.4.6.43401.1.ppc64le numactl-libs-2.0.9-7.el7.ppc64le opensm-libs-5.0.0.MLNX20180219.c610c42-0.1.43401.1.ppc64le openssl-libs-1.0.2k-12.el7.ppc64le
---Type <return> to continue, or q <return> to quit---
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
(gdb) set pagination off
(gdb) bt
#0  0x000020000001b434 in _dl_allocate_tls_init (result=0x20002261a140) at dl-tls.c:471
#1  __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
#2  0x00002000014b9ba0 in allocate_stack (stack=<synthetic pointer>, pdp=<synthetic pointer>, attr=0x7fffffff4980) at allocatestack.c:539
#3  __pthread_create_2_1 (newthread=0x7fffffff4940, attr=0x7fffffff4980, start_routine=0x200000097ec0 <finalize_all_thread_data>, arg=0x2000083c70f0) at pthread_create.c:447
#4  0x000020000012628c in pthread_create () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so
#5  0x0000200000098640 in hpcrun_threadMgr_data_fini () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so
#6  0x0000200000084bdc in hpcrun_fini_internal () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so
#7  0x0000200000085558 in monitor_fini_process () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so
#8  0x00002000001229c0 in monitor_end_process_fcn () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so
#9  0x00002000001181b4 in monitor_signal_handler () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so
#10 <signal handler called>
#11 0x000020000001b434 in _dl_allocate_tls_init (result=0x20002220a140) at dl-tls.c:471
#12 __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
#13 0x00002000014b9ba0 in allocate_stack (stack=<synthetic pointer>, pdp=<synthetic pointer>, attr=0x7fffffff5ed0) at allocatestack.c:539
#14 __pthread_create_2_1 (newthread=0x153721d8, attr=0x7fffffff5ed0, start_routine=0x200000124ba0 <monitor_begin_thread>, arg=0x200000146410 <monitor_init_tn_array+400>) at pthread_create.c:447
#15 0x0000200000126414 in pthread_create () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so
#16 0x0000200000428008 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#17 0x0000200000624440 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#18 0x00002000005d3c4c in ?? () from /usr/lib64/nvidia/libcuda.so.1
#19 0x0000200001cff424 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2
#20 0x0000200001ce2bdc in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2
#21 0x0000200001ce7578 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2
#22 0x00002000005c3b0c in ?? () from /usr/lib64/nvidia/libcuda.so.1
#23 0x00002000003df3fc in ?? () from /usr/lib64/nvidia/libcuda.so.1
#24 0x0000200000582f54 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#25 0x00002000002d5588 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#26 0x00002000002d5728 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#27 0x00002000004805ec in cuLaunchKernel () from /usr/lib64/nvidia/libcuda.so.1
#28 0x0000200000fce4c4 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2
#29 0x000020000101be20 in cudaLaunchKernel () from /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2
#30 0x0000000010064238 in vector_op_eq(int, double, double*) ()
#31 0x000000001002f178 in mfem::CudaVector::CudaVector(unsigned long, double) ()
#32 0x000000001002b438 in mfem::hydrodynamics::LagrangianHydroOperator::LagrangianHydroOperator(int, mfem::CudaFiniteElementSpace&, mfem::CudaFiniteElementSpace&, mfem::Array<int>&, mfem::CudaGridFunction&, int, double, mfem::Coefficient*, bool, bool, double, int) ()
#33 0x00000000100188c0 in main ()
(gdb) p listp
$1 = (struct dtv_slotinfo_list *) 0x0
(gdb) p result
$2 = (void *) 0x20002261a140
(gdb) p dtv
$3 = (dtv_t *) 0x11c93610
(gdb) p dtv[-1]
$4 = {counter = 106, pointer = {val = 0x6a, is_static = false}}

Comment 21 Ben Woodard 2019-02-12 22:37:09 UTC

Created attachment 1534261 [details]
core file running under the new glibc packages

Comment 22 Ben Woodard 2019-02-12 22:37:59 UTC

Created attachment 1534262 [details]
actual executable

Comment 23 Ben Woodard 2019-02-13 01:56:41 UTC

Here is the perplexing part:
466	  listp = GL(dl_tls_dtv_slotinfo_list);

(gdb) p *_rtld_local._dl_tls_dtv_slotinfo_list
$12 = {len = 69, next = 0x0, slotinfo = 0x2000022d4058}

471	      for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt)

So the for loop will go through 68 times leaving cnt at 69 when the loop terminates. This cnt gets moved to total.

514	      total += cnt;

(gdb) p total
$18 = 69

but the termination condition is:

515	      if (total >= GL(dl_tls_max_dtv_idx))
516		break;

(gdb) p _rtld_local._dl_tls_max_dtv_idx                   
$19 = 74

so we don't break out which puts us in the case we increment listp whose next pointer is NULL.

Comment 24 Ben Woodard 2019-02-14 20:12:07 UTC

(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[0].map.l_name
$36 = 0x2000000281a0 ""
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[1].map.l_name
$37 = 0x2000000281a0 ""
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[2].map.l_name
$29 = 0x200000043748 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[3].map.l_name
$30 = 0x200000046e38 "/usr/tce/packages/gcc/gcc-4.9.3/gnu/lib64/libstdc++.so.6"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[4].map.l_name
$31 = 0x200000048208 "/lib64/libc.so.6"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[5].map.l_name
$32 = 0x20000004bf60 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/elfutils-0.174-figlq6trgfl7hv3trmucrlmah6myu3yu/lib/libelf.so.1"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[6].map.l_name
$33 = 0x20000004dc10 "/usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[7].map.l_name
$34 = 0x11843ce0 "/lib64/libnuma.so.1"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[8].map.l_name
$35 = 0x1192a7f0 "/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/spectrum_mpi/mca_pml_pami.so"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[9].map.l_name
Cannot access memory at address 0x8
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[9].map       
$38 = (struct link_map *) 0x0
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[9]    
$39 = {gen = 0, map = 0x0}
(gdb) p _rtld_local._dl_tls_max_dtv_idx
$40 = 74
(gdb) p _rtld_local._dl_tls_dtv_gaps
$41 = false

So somewhere dl_tls_max_dtv_idx is getting corrupted.

Breakpoint 1 at 0x20000001b118: _dl_allocate_tls_init. (2 locations)
Missing separate debuginfos, use: debuginfo-install openssl-libs-1.0.2k-12.el7.ppc64le
(gdb) commands 1
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>p _rtld_local._dl_tls_max_dtv_idx
>c
>end
(gdb) c
Continuing.

Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533	  return _dl_allocate_tls_init (mem == NULL
$1 = 6
[New Thread 0x2000034299f0 (LWP 66689)]
<snip>
Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533	  return _dl_allocate_tls_init (mem == NULL
$2 = 6
[New Thread 0x200003e999f0 (LWP 66691)]

Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533	  return _dl_allocate_tls_init (mem == NULL
$3 = 6
[New Thread 0x200008b599f0 (LWP 66692)]
Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533	  return _dl_allocate_tls_init (mem == NULL
$4 = 8
[New Thread 0x200020a999f0 (LWP 66705)]
[laghos] MPI is NOT CUDA aware
[laghos] CUDA device count: 4
[laghos] Rank_0 => Device_0 (Tesla V100-SXM2-16GB:sm_7.0)

Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533	  return _dl_allocate_tls_init (mem == NULL
$5 = 8
[New Thread 0x2000213b99f0 (LWP 66706)]
[laghos] Cartesian partitioning will be used
[laghos] pmesh->GetNE()=256
Number of kinematic (position, velocity) dofs: 2178
Number of specific internal energy dofs: 1024


Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533	  return _dl_allocate_tls_init (mem == NULL
$6 = 74

Program received signal SIGSEGV, Segmentation fault.
0x000020000001b434 in _dl_allocate_tls_init (result=0x20002220a140) at dl-tls.c:471
471	      for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt)
Missing separate debuginfos, use: debuginfo-install libibumad-43.1.1.MLNX20171122.0eb0969-0.1.43401.1.ppc64le libibverbs-41mlnx1-OFED.4.3.2.1.6.43401.1.ppc64le libmlx4-41mlnx1-OFED.4.1.0.1.0.43401.1.ppc64le libmlx5-41mlnx1-OFED.4.3.4.0.3.43401.1.ppc64le libnl3-3.2.28-4.el7.ppc64le librxe-41mlnx1-OFED.4.4.2.4.6.43401.1.ppc64le numactl-libs-2.0.9-7.el7.ppc64le opensm-libs-5.0.0.MLNX20180219.c610c42-0.1.43401.1.ppc64le
(gdb)

Comment 25 Ben Woodard 2019-02-14 20:37:42 UTC

Hmm this looks suspicious

(gdb) c
Continuing.
[New Thread 0x200020a999f0 (LWP 70543)]
[laghos] MPI is NOT CUDA aware
[laghos] CUDA device count: 4
[laghos] Rank_0 => Device_0 (Tesla V100-SXM2-16GB:sm_7.0)

Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533	  return _dl_allocate_tls_init (mem == NULL
(gdb) watch _rtld_local._dl_tls_max_dtv_idx
Hardware watchpoint 2: _rtld_local._dl_tls_max_dtv_idx
(gdb) c
Continuing.
[New Thread 0x2000213b99f0 (LWP 71117)]
[laghos] Cartesian partitioning will be used
[laghos] pmesh->GetNE()=256
Hardware watchpoint 2: _rtld_local._dl_tls_max_dtv_idx

Old value = 8
New value = 9
_dl_next_tls_modid () at dl-tls.c:104
104	}
(gdb) bt
#0  _dl_next_tls_modid () at dl-tls.c:104
#1  0x0000200000008044 in _dl_map_object_from_fd (name=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", origname=0x0, fd=<optimized out>, fbp=0x7fffffff34c0, 
    realname=0x1530a370 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", loader=0x0, l_type=2, mode=-1879048191, stack_endp=0x7fffffff3820, nsid=0) at dl-load.c:1199
#2  0x000020000000be6c in _dl_map_object (loader=0x0, name=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", type=<optimized out>, trace_mode=<optimized out>, mode=<optimized out>, nsid=<optimized out>)
    at dl-load.c:2400
#3  0x000020000001d5b0 in dl_open_worker (a=0x7fffffff3dd0) at dl-open.c:231
#4  0x00002000000170d0 in _dl_catch_error (objname=0x7fffffff3e30, errstring=0x7fffffff3e20, mallocedp=0x7fffffff3e40, operate=0x20000001d0b0 <dl_open_worker>, args=0x7fffffff3dd0) at dl-error.c:177
#5  0x000020000001ca0c in _dl_open (file=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", mode=<optimized out>, caller_dlopen=0x200000115a20 <dlopen+128>, nsid=-2, argc=<optimized out>, argv=0x7fffffffa118, 
    env=0x11bca9a0) at dl-open.c:649
#6  0x00002000016e1138 in dlopen_doit (a=0x7fffffff4270) at dlopen.c:66
#7  0x00002000000170d0 in _dl_catch_error (objname=0x1177f990, errstring=0x1177f998, mallocedp=0x1177f988, operate=0x2000016e10a0 <dlopen_doit>, args=0x7fffffff4270) at dl-error.c:177
#8  0x00002000016e1c18 in _dlerror_run (operate=0x2000016e10a0 <dlopen_doit>, args=0x7fffffff4270) at dlerror.c:163
#9  0x00002000016e1238 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#10 0x0000200000115a20 in dlopen () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so
#11 0x00002000000b9978 in cupti_lm_contains_fn () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so
#12 0x00002000000bbbdc in cupti_callstack_ignore_map_ignore () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so
#13 0x00002000000b8240 in cupti_correlation_callback_cuda () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so
#14 0x00002000000b9418 in cupti_subscriber_callback () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so
#15 0x0000200001cb4218 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2
#16 0x0000200001ce4778 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2
#17 0x0000200001ce7578 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2
#18 0x00002000005c3b0c in ?? () from /usr/lib64/nvidia/libcuda.so.1
#19 0x000020000047aae0 in cuMemcpyHtoD_v2 () from /usr/lib64/nvidia/libcuda.so.1
#20 0x0000000010073b40 in mfem::rmemcpy::rHtoD(void*, void const*, unsigned long, bool) ()
#21 0x0000000010068ae4 in mfem::CudaFiniteElementSpace::CudaFiniteElementSpace(mfem::Mesh*, mfem::FiniteElementCollection const*, int, mfem::Ordering::Type) ()
#22 0x0000000010017f90 in main ()

Comment 26 Ben Woodard 2019-02-14 22:47:57 UTC

Putting it all together, I sent this to the customer:
-------------------

When digging around in _dl_allocate_tls_init you see that listp comes from:
466	  listp = GL(dl_tls_dtv_slotinfo_list);

(gdb) p *_rtld_local._dl_tls_dtv_slotinfo_list
$12 = {len = 69, next = 0x0, slotinfo = 0x2000022d4058}

471	      for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt)

So the for loop will go through 68 times leaving cnt at 69 when the loop terminates. This cnt gets moved to total.

514	      total += cnt;

(gdb) p total
$18 = 69

but the termination condition is:

515	      if (total >= GL(dl_tls_max_dtv_idx))
516		break;

(gdb) p _rtld_local._dl_tls_max_dtv_idx                   
$19 = 74

so we don't break out which puts us in the case we increment listp whose next pointer is NULL.

but when you look at the libraries that actually use TLS there are only a few which actually use TLS there are only a few. so the 74 seemed weird especially since the vector only has 69 slots.

At first I thought all the shared libs even if they didn’t have TLS were inserted into this array but when you look at other places in the dynamic linker’s code where it iterates through the ELF section headers, you can see that the only time it is inserted into this array is hen it does have TLS — which makes more sense.

This is easy to confirm:
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[0].map.l_name
$36 = 0x2000000281a0 ""
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[1].map.l_name
$37 = 0x2000000281a0 ""
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[2].map.l_name
$29 = 0x200000043748 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[3].map.l_name
$30 = 0x200000046e38 "/usr/tce/packages/gcc/gcc-4.9.3/gnu/lib64/libstdc++.so.6"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[4].map.l_name
$31 = 0x200000048208 "/lib64/libc.so.6"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[5].map.l_name
$32 = 0x20000004bf60 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/elfutils-0.174-figlq6trgfl7hv3trmucrlmah6myu3yu/lib/libelf.so.1"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[6].map.l_name
$33 = 0x20000004dc10 "/usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[7].map.l_name
$34 = 0x11843ce0 "/lib64/libnuma.so.1"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[8].map.l_name
$35 = 0x1192a7f0 "/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/spectrum_mpi/mca_pml_pami.so"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[9].map.l_name
Cannot access memory at address 0x8

Also I can see that the rest of the entries are also empty.

So the question is why is  _rtld_local._dl_tls_max_dtv_idx 74 then?

Taking a look at the variable it appears sensible up to a point before it goes off the rails.:

Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533	  return _dl_allocate_tls_init (mem == NULL
$3 = 6
[New Thread 0x200008b599f0 (LWP 66692)]
Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533	  return _dl_allocate_tls_init (mem == NULL
$4 = 8
[New Thread 0x200020a999f0 (LWP 66705)]
[laghos] MPI is NOT CUDA aware
[laghos] CUDA device count: 4
[laghos] Rank_0 => Device_0 (Tesla V100-SXM2-16GB:sm_7.0)

Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533	  return _dl_allocate_tls_init (mem == NULL
$5 = 8
[New Thread 0x2000213b99f0 (LWP 66706)]
[laghos] Cartesian partitioning will be used
[laghos] pmesh->GetNE()=256
Number of kinematic (position, velocity) dofs: 2178
Number of specific internal energy dofs: 1024


Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533	  return _dl_allocate_tls_init (mem == NULL
$6 = 74

Program received signal SIGSEGV, Segmentation fault.
0x000020000001b434 in _dl_allocate_tls_init (result=0x20002220a140) at dl-tls.c:471
471	      for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt)

Watching that variable at the place where it goes off the rails, I run into:

(gdb) c
Continuing.
[New Thread 0x2000213b99f0 (LWP 71117)]
[laghos] Cartesian partitioning will be used
[laghos] pmesh->GetNE()=256
Hardware watchpoint 2: _rtld_local._dl_tls_max_dtv_idx

Old value = 8
New value = 9
_dl_next_tls_modid () at dl-tls.c:104
104	}
(gdb) bt
#0  _dl_next_tls_modid () at dl-tls.c:104
#1  0x0000200000008044 in _dl_map_object_from_fd (name=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", origname=0x0, fd=<optimized out>, fbp=0x7fffffff34c0, 
    realname=0x1530a370 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", loader=0x0, l_type=2, mode=-1879048191, stack_endp=0x7fffffff3820, nsid=0) at dl-load.c:1199
#2  0x000020000000be6c in _dl_map_object (loader=0x0, name=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", type=<optimized out>, trace_mode=<optimized out>, mode=<optimized out>, nsid=<optimized out>)
    at dl-load.c:2400
#3  0x000020000001d5b0 in dl_open_worker (a=0x7fffffff3dd0) at dl-open.c:231
#4  0x00002000000170d0 in _dl_catch_error (objname=0x7fffffff3e30, errstring=0x7fffffff3e20, mallocedp=0x7fffffff3e40, operate=0x20000001d0b0 <dl_open_worker>, args=0x7fffffff3dd0) at dl-error.c:177
#5  0x000020000001ca0c in _dl_open (file=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", mode=<optimized out>, caller_dlopen=0x200000115a20 <dlopen+128>, nsid=-2, argc=<optimized out>, argv=0x7fffffffa118, 
    env=0x11bca9a0) at dl-open.c:649
#6  0x00002000016e1138 in dlopen_doit (a=0x7fffffff4270) at dlopen.c:66
#7  0x00002000000170d0 in _dl_catch_error (objname=0x1177f990, errstring=0x1177f998, mallocedp=0x1177f988, operate=0x2000016e10a0 <dlopen_doit>, args=0x7fffffff4270) at dl-error.c:177
#8  0x00002000016e1c18 in _dlerror_run (operate=0x2000016e10a0 <dlopen_doit>, args=0x7fffffff4270) at dlerror.c:163
#9  0x00002000016e1238 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#10 0x0000200000115a20 in dlopen () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so
#11 0x00002000000b9978 in cupti_lm_contains_fn () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so
#12 0x00002000000bbbdc in cupti_callstack_ignore_map_ignore () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so
#13 0x00002000000b8240 in cupti_correlation_callback_cuda () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so
#14 0x00002000000b9418 in cupti_subscriber_callback () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so
#15 0x0000200001cb4218 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2
#16 0x0000200001ce4778 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2
#17 0x0000200001ce7578 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2
#18 0x00002000005c3b0c in ?? () from /usr/lib64/nvidia/libcuda.so.1
#19 0x000020000047aae0 in cuMemcpyHtoD_v2 () from /usr/lib64/nvidia/libcuda.so.1
#20 0x0000000010073b40 in mfem::rmemcpy::rHtoD(void*, void const*, unsigned long, bool) ()
#21 0x0000000010068ae4 in mfem::CudaFiniteElementSpace::CudaFiniteElementSpace(mfem::Mesh*, mfem::FiniteElementCollection const*, int, mfem::Ordering::Type) ()
#22 0x0000000010017f90 in main ()

The most interesting things in there for me are:
#10 0x0000200000115a20 in dlopen () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so
#11 0x00002000000b9978 in cupti_lm_contains_fn () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so

cupti is calling dlopen through libmonitor.so and something goes wrong because we have: _dlerror_run and _dl_catch_error and in the process _dl_tls_max_dtv_idx gets incremented even though nothing is being added to the _rtld_local._dl_tls_dtv_slotinfo_list

Oddly it seems like what is being dlopened over and over is:
/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos

anyway that is as far as I’ve gotten. I’m out of the TLS code now and over looking at the dlopen error handling code. I haven’t yet figured out what the precise problem is but I wanted to give you an update so far and ask if there is any possibility that something libmonitor might be doing may have triggered the problem.

-ben

----------------
He wrote back:

Thanks so much for your help. I think that your work has paid off!

I think that we have two problems: a bug in hpctoolkit and a bug in the loader.

Based on your detective work below, we looked at the implementation of cupti_callstack_ignore_map_ignore and the functions it calls. We want to ignore procedure frames that belong to any of three NVIDIA load modules libcuda, libcupti, and libcuda_rt. We ask whether a procedure frame belongs to one of nvidia’s load modules by checking whether it is in the same shared library where some known nvidia functions are defined. (We don’t want to depend on the names of the libraries or the paths to them since these are not standard.) If a load module belongs to nvidia, we remember it in a data structure called cupti_callstack_ignore_map.

In looking at our code for cupti_lm_contains_fn, in the following file:

	https://github.com/HPCToolkit/hpctoolkit/blob/ompt-device-cfg-master/src/tool/hpcrun/sample-sources/nvidia/cupti-api.c 

we found that it contains a dlopen without a matching dlclose. That clearly seems to be a bug in our code.

The issue for the loader might be that when the same load module is opened several times, the TLS data structures get corrupted.

We are swamped at present but will make the change to our code, test it, and see if it resolves our problem.  I think there is still a bug in the loader that our code trips.

We can improve our code to not test the load map.

---------------

I’ll keep looking to see if I can figure out why dlopen()’ing an already open file increments the _rtld_local._dl_tls_max_dtv_idx even when nothing new is added. That seems like a much simpler problem.

Comment 27 Ben Woodard 2019-02-14 23:27:09 UTC

It occurs to me that this could really easily be a cross platform latent bug that no one uncovered before because what you would need to do is do whatever triggers the accounting bug (probably opening an already open object that uses TLS) > 70-<number of libs that use TLS> times. If a program did it fewer times, the code in _dl_allocate_tls_init is robust enough that it will just skip over those unused entries in the slotinfo table without causing a problem.

Comment 29 Carlos O'Donell 2019-12-16 15:06:56 UTC

Ben,

The team and I went through the logs for this bug again and we believe this might have been fixed in rhel-7.8:
https://bugzilla.redhat.com/show_bug.cgi?id=1740039

The reason we think it's this bug is because we see:
open("/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", O_RDONLY|O_CLOEXEC) = 59
in the traces.

This looks like the application is trying to dlopen copies of itself, so we think this particular failure might be fixed. In the case of bug 1740039 we had a good reproducer we could use to track down the failure.

Are you able to try this again with rhel-7.8 and see if you can reproduce the instability?

Comment 30 Ben Woodard 2019-12-16 20:59:19 UTC

We got this one. The customer was VERY happy.
Yes the problem was related to dlopen'ing copies of itself.

"Thanks so much for your help. I think that your work has paid off!

I think that we have two problems: a bug in hpctoolkit and a bug in the loader.

Based on your detective work below, we looked at the implementation of cupti_callstack_ignore_map_ignore and the functions it calls. We want to ignore procedure frames that belong to any of three NVIDIA load modules libcuda, libcupti, and libcuda_rt. We ask whether a procedure frame belongs to one of nvidia’s load modules by checking whether it is in the same shared library where some known nvidia functions are defined. (We don’t want to depend on the names of the libraries or the paths to them since these are not standard.) If a load module belongs to nvidia, we remember it in a data structure called cupti_callstack_ignore_map.

In looking at our code for cupti_lm_contains_fn, in the following file:

https://github.com/HPCToolkit/hpctoolkit/blob/ompt-device-cfg-master/src/tool/hpcrun/sample-sources/nvidia/cupti-api.c

we found that it contains a dlopen without a matching dlclose. That clearly seems to be a bug in our code.

The issue for the loader might be that when the same load module is opened several times, the TLS data structures get corrupted.

We are swamped at present but will make the change to our code, test it, and see if it resolves our problem. I think there is still a bug in the loader that our code trips.

We can improve our code to not test the load map."

Even at that time, my attempts to trigger the problem by just repeatedly dlopening the same file over and over, didn't cause the problem to reproduce. Because the customer was happy and I couldn't figure out how to reproduce the problem, I didn't come back to trying to find out exactly why this triggered a problem in the TLS in their case. It evidently was subtle interaction between the threads and the number of libraries with TLS and how many times they were opened.

Comment 31 Carlos O'Donell 2020-01-10 14:53:07 UTC

Ben,

We are going to consider this bug fixed in rhel-7.8 then, and if we hit this again we can open another bug with another detailed analysis of the failure.

I'm marking this CLOSED/DUPLICATE.

*** This bug has been marked as a duplicate of bug 1740039 ***