Bug 1670620
Summary: | glibc: segv allocating TLS in runtime linker on ppc64le | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ben Woodard <woodard> | ||||||||
Component: | glibc | Assignee: | glibc team <glibc-bugzilla> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | qe-baseos-tools-bugs | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 7.6-Alt | CC: | ashankar, codonell, dj, foraker1, fweimer, mnewsome, pfrankli, tgummels, woodard | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | ppc64le | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: |
CORAL
|
|||||||||
Last Closed: | 2020-01-10 14:53:07 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1599298 | ||||||||||
Attachments: |
|
Description
Ben Woodard
2019-01-29 22:17:55 UTC
Thanks for the initial analysis. How can we get you a new glibc to test? If we put together a testfix I assume that we should build it for rhel-7.6 and ppc64le? Can we send you rpms to install and test? Because of the machine that I need to test it on, it would be easiest if you just made a git branch that I could pull build and run that like any normal glibc test build. If that is too difficult, I can arrange to get root on the system and then take a few nodes out of the cluster and install a custom system image with test glibc RPMs on them. That is fine, it is just more work for me. (In reply to Ben Woodard from comment #6) > Because of the machine that I need to test it on, it would be easiest if you > just made a git branch that I could pull build and run that like any normal > glibc test build. I'm worried this will not yield a correct result at the customer site e.g. running with wrong libraries etc. > If that is too difficult, I can arrange to get root on the system and then > take a few nodes out of the cluster and install a custom system image with > test glibc RPMs on them. That is fine, it is just more work for me. This is what I strongly recommend. We really really really want a 100% bullet proof assurance that you're using all parts of the new runtime. To avoid making any mistakes it's best to install a testfix glibc. Can you setup those nodes and verify you can reproduce the problem on them? I'm building you a testfix with assertions enabled to see if the TLS assertions trigger. While waiting for the affected team to assemble a reproducer for me. I compiled up a the affected code and looked at what parts of it use TLS. using this to gather data. $ ldd laghos | sed -e 's/.*>//' -e 's/.0x.*//' | while read i;do echo --- $i; eu-readelf -S $i;done | egrep tbss\|---\|tdata Then I annotated the data from ldd (there is no difference between the T and the t -- I just bumped the caps-lock key wile copying from window to another.) $ ldd laghos linux-vdso.so.1 (0x00007ffd1efeb000) libHYPRE-2.15.1.so => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/hypre-2.15.1-dzfmkgkwd3zakwp5p4y4i33j7qxfdeop/lib/libHYPRE-2.15.1.so (0x00007f22c72f4000) libopenblas.so.0 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/openblas-0.3.5-2xivefu4hjfalpivsdto7iqndctk2jxo/lib/libopenblas.so.0 (0x00007f22c673e000) T libmetis.so => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/metis-5.1.0-z2scdq3fgdep4v7e5ivoukxl5ismdua3/lib/libmetis.so (0x00007f22c66ce000) librt.so.1 => /usr/lib64/librt.so.1 (0x00007f22c66c4000) libz.so.1 => /usr/lib64/libz.so.1 (0x00007f22c66aa000) libmpi_cxx.so.40 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/openmpi-3.1.3-45urwiozdivamagc2h6norga22wgmr7b/lib/libmpi_cxx.so.40 (0x00007f22c668d000) libmpi.so.40 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/openmpi-3.1.3-45urwiozdivamagc2h6norga22wgmr7b/lib/libmpi.so.40 (0x00007f22c6423000) T libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f22c628b000) libm.so.6 => /usr/lib64/libm.so.6 (0x00007f22c6105000) libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f22c60ea000) libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f22c60c8000) t libc.so.6 => /usr/lib64/libc.so.6 (0x00007f22c5f02000) libgfortran.so.5 => /usr/lib64/libgfortran.so.5 (0x00007f22c5c85000) /lib64/ld-linux-x86-64.so.2 (0x00007f22c7704000) libopen-rte.so.40 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/openmpi-3.1.3-45urwiozdivamagc2h6norga22wgmr7b/lib/libopen-rte.so.40 (0x00007f22c5b55000) libopen-pal.so.40 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/openmpi-3.1.3-45urwiozdivamagc2h6norga22wgmr7b/lib/libopen-pal.so.40 (0x00007f22c5948000) libutil.so.1 => /usr/lib64/libutil.so.1 (0x00007f22c5943000) libhwloc.so.5 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/hwloc-1.11.11-42ceqbpi2stihgk4eqhcemifgcsvkjxa/lib/libhwloc.so.5 (0x00007f22c5902000) t libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f22c58f4000) t libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f22c58c9000) libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0 (0x00007f22c58be000) libxml2.so.2 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/libxml2-2.9.8-uypnlww3lv5zp4qqu2l5bsbwlx3lpe2c/lib/libxml2.so.2 (0x00007f22c5759000) libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f22c5753000) liblzma.so.5 => /usr/lib64/liblzma.so.5 (0x00007f22c572a000) libiconv.so.2 => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/libiconv-1.15-wu2oqeyswzh3wq6pkwyjqmm5vdln23qy/lib/libiconv.so.2 (0x00007f22c562b000) libquadmath.so.0 => /usr/lib64/libquadmath.so.0 (0x00007f22c55e6000) t libmount.so.1 => /usr/lib64/libmount.so.1 (0x00007f22c5589000) t libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007f22c5536000) t libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007f22c552d000) t libselinux.so.1 => /usr/lib64/libselinux.so.1 (0x00007f22c54fe000) libpcre2-8.so.0 => /usr/lib64/libpcre2-8.so.0 (0x00007f22c5478000) So there appears to be a considerable amount of TLS. [ben@Mustang Work]$ grep ^[tT] tls-bug.txt T libmetis.so => /home/ben/Work/spack/opt/spack/linux-fedora29-x86_64/gcc-8.2.1/metis-5.1.0-z2scdq3fgdep4v7e5ivoukxl5ismdua3/lib/libmetis.so (0x00007f22c66ce000) T libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f22c628b000) t libc.so.6 => /usr/lib64/libc.so.6 (0x00007f22c5f02000) t libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f22c58f4000) t libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f22c58c9000) t libmount.so.1 => /usr/lib64/libmount.so.1 (0x00007f22c5589000) t libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007f22c5536000) t libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007f22c552d000) t libselinux.so.1 => /usr/lib64/libselinux.so.1 (0x00007f22c54fe000) This was on the x86_64 version. The original version uses OpenMPI, hpctoolkit, and CUPTI. So it is probably considerably more complicated. I was going to look into on a bigger system to see the overall startup with regards to threading to get a handle on that. Ben, I have a rhel-7.6 build with assert's enabled. Create the following /etc/yum.repos.d/rhbz1670620.repo ~~~ [rhbz1670620] name=RHEL 7.6 testfix for bug 1670620 baseurl=https://people.redhat.com/codonell/rhel-7.6-rhbz1670620 enabled=1 gpgcheck=0 protect=1 ~~~ You should be able to upgrade to the testfix glibc. As we test things I'll just keep bumping the testfix # and yum upgrade should work. I tested the assert enabled rpms on a POWER8 VM system by installing them and rebooting, and I didn't see any functional problems, so it should be safe to use on another ppc64le system. I got the instructions from J M-C on how to reproduce this. I’ll work on trying to reproduce them on LLNL’s system. Keren and I finally have a simple to build and run TLS bug reproducer that you can use at LLNL. It took a little longer to integrate our hpctoolkit GPU prototype into spack, but it will help us immensely going forward. git clone https://github.com/jmellorcrummey/bugs Follow the simple directions in bugs/tls-bug-reproducer/README.md The only thing that I didn’t properly account for in the repository is that it assumes that my spack compiler settings in ./spack/linux/compilers.yaml include the following Using a basic spack repository, used the following to get my ./spack world set up module load gcc/7.3.1 spack compiler find When you follow the directions in the repository I have provided, it will download and build a custom spack repository from github.com/jmellorcrummey/spack. This repository includes some private modifications to several packages to build our GPU prototype. Let us know if you have any questions. When you run make tls-bug You will see that using hpcrun to monitor a Laghos execution dies. In the tls-bug directory, the Makefile in the tls-bug supports make inspect which will run a gdb on the Laghos binary (supplying its obscure path from my build world) and the corefile, which will let you inspect the wreckage after the bug triggers. On rzansel, apparently debug symbols are available, so I see the failed execution in the loop where listp == NULL as it tries to dereference listp->len. You should be able to download and build this on any LLNL P9 system and replicate the bug. I added a “debug” target to the Makefile in the tls-bug directory See https://github.com/jmellorcrummey/bugs/blob/master/tls-bug-reproducer/tls-bug/Makefile The README.md file in that directory describes how to use gdb with hpcrun See https://github.com/jmellorcrummey/bugs/blob/master/tls-bug-reproducer/tls-bug/README.md confirmed that J M-C's reproducer works for me. $ LD_DEBUG=all LD_DEBUG_OUTPUT=ldout !! LD_DEBUG=all LD_DEBUG_OUTPUT=ldout mpirun -np 1 hpcrun -e nvidia-cuda ../laghos/Laghos/cuda/laghos -p 0 -m ../laghos/Laghos/data/square01_quad.mesh -rs 3 -tf 0.75 -pa __ __ / / ____ ____ / /_ ____ _____ / / / __ `/ __ `/ __ \/ __ \/ ___/ / /___/ /_/ / /_/ / / / / /_/ (__ ) /_____/\__,_/\__, /_/ /_/\____/____/ /____/ Options used: --mesh ../laghos/Laghos/data/square01_quad.mesh --refine-serial 3 --refine-parallel 0 --problem 0 --order-kinematic 2 --order-thermo 1 --ode-solver 4 --t-final 0.75 --cfl 0.5 --cg-tol 1e-08 --cg-max-steps 300 --max-steps -1 --partial-assembly --no-visualization --visualization-steps 5 --no-visit --no-print --outputfilename results/Laghos --no-uvm --no-aware --no-hcpo --no-sync --no-share [laghos] MPI is NOT CUDA aware [laghos] CUDA device count: 4 [laghos] Rank_0 => Device_0 (Tesla V100-SXM2-16GB:sm_7.0) [laghos] Cartesian partitioning will be used [laghos] pmesh->GetNE()=256 Number of kinematic (position, velocity) dofs: 2178 Number of specific internal energy dofs: 1024 [lassen708:108589] *** Process received signal *** [lassen708:108589] Signal: Segmentation fault (11) [lassen708:108589] Signal code: Address not mapped (1) [lassen708:108589] Failing at address: (nil) [lassen708:108589] [ 0] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(+0x80f8)[0x2000001180f8] [lassen708:108589] [ 1] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000504d8] [lassen708:108589] [ 2] /lib64/ld64.so.2(_dl_allocate_tls+0x100)[0x20000001a440] [lassen708:108589] [ 3] /lib64/libpthread.so.0(pthread_create+0x9b0)[0x2000014b9b00] [lassen708:108589] [ 4] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(pthread_create+0x2a4)[0x200000126414] [lassen708:108589] [ 5] /usr/lib64/nvidia/libcuda.so.1(+0x238008)[0x200000428008] [lassen708:108589] [ 6] /usr/lib64/nvidia/libcuda.so.1(+0x434440)[0x200000624440] [lassen708:108589] [ 7] /usr/lib64/nvidia/libcuda.so.1(+0x3e3c4c)[0x2000005d3c4c] [lassen708:108589] [ 8] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x12f424)[0x200001cff424] [lassen708:108589] [ 9] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x112bdc)[0x200001ce2bdc] [lassen708:108589] [10] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x117578)[0x200001ce7578] [lassen708:108589] [11] /usr/lib64/nvidia/libcuda.so.1(+0x3d3b0c)[0x2000005c3b0c] [lassen708:108589] [12] /usr/lib64/nvidia/libcuda.so.1(+0x1ef3fc)[0x2000003df3fc] [lassen708:108589] [13] /usr/lib64/nvidia/libcuda.so.1(+0x392f54)[0x200000582f54] [lassen708:108589] [14] /usr/lib64/nvidia/libcuda.so.1(+0xe5588)[0x2000002d5588] [lassen708:108589] [15] /usr/lib64/nvidia/libcuda.so.1(+0xe5728)[0x2000002d5728] [lassen708:108589] [16] /usr/lib64/nvidia/libcuda.so.1(cuLaunchKernel+0x24c)[0x2000004805ec] [lassen708:108589] [17] /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2(+0xe4c4)[0x200000fce4c4] [lassen708:108589] [18] /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2(cudaLaunchKernel+0x230)[0x20000101be20] [lassen708:108589] [19] ../laghos/Laghos/cuda/laghos[0x10064238] [lassen708:108589] [20] ../laghos/Laghos/cuda/laghos[0x1002f178] [lassen708:108589] [21] ../laghos/Laghos/cuda/laghos[0x1002b438] [lassen708:108589] [22] ../laghos/Laghos/cuda/laghos[0x100188c0] [lassen708:108589] [23] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(monitor_main+0x128)[0x200000122da8] [lassen708:108589] [24] /lib64/libc.so.6(+0x25100)[0x200001515100] [lassen708:108589] [25] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000015152f4] [lassen708:108589] [26] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(__libc_start_main+0xf0)[0x200000121e30] [lassen708:108589] *** End of error message *** ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node lassen708 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- I haven't had a chance to test with the new glibc provided above. Attached is the strace.out Created attachment 1527716 [details]
the strace for the crash
This can be used to see how the processes are started and related.
The output of the run above LD_DEBUG is too big to attach to the bug and so I put the output at: http://ssh.bencoyote.net/~ben/ldout.tar.gz I'll spend some time looking at the output to try to grok what is going on. Things like how many threads and how they start. One observation that I can make even early as I analyze this is this feels more like a corruption issue rather than a data race. With most data races the problem is highly sensitive to interruption and any sort of jostling of the timing would move the problem around. This particular problem didn't seem affected by either LD_DEBUG=all which generated about 1.9GB of data nor strace. Some additional notes from the original reporters: Some variable values inside _dl_allocate_tls_init; others were optimized out. For the call where _dl_allocate_tls_init failed, the dtv had 69 entries in it. I looked at /proc/<pid>/maps and as I recall there were 98 things mentioned that had executable code, i.e. their line in maps had ‘r-x’ in it. I thought were all unique. I didn’t track down why 98 != 69. Anyway, I don’t understand all of the pieces in the _dl_allocate_tls_init code. libhpcrun.so.0.0.0 has thread local data. The libmonitor library is a preloaded library that wraps pthread_create. The problem appears on the fourth call to _dl_allocate_tls_init, this there are only a few threads involved. You can watch each get created with a breakpoint in pthread_create and see how they are created. The problematic thread that causes the error when it is initialized is created by NVIDIA’s cuLaunchKernel, which is in a closed source library. I believe that cuLaunchKernel only creates a thread if NVIDIA’s CUPTI library is involved to monitor GPU activity. Interestingly the whole setup works fine profiling LULESH, but both the raja or CUDA version of laghos both fail. Both of these are designed as test apps to model real HPC apps. [butte5:47379] *** Process received signal *** [butte5:47379] Signal: Segmentation fault (11) [butte5:47379] Signal code: Address not mapped (1) [butte5:47379] Failing at address: (nil) [butte5:47379] [ 0] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(+0x80f8)[0x2000001180f8] [butte5:47379] [ 1] [0x2000000504d8] [butte5:47379] [ 2] /lib64/ld64.so.2(_dl_allocate_tls+0x100)[0x20000001b480] [butte5:47379] [ 3] /lib64/libpthread.so.0(pthread_create+0x9b0)[0x2000014b9ba0] [butte5:47379] [ 4] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(pthread_create+0x2a4)[0x200000126414] [butte5:47379] [ 5] /usr/lib64/nvidia/libcuda.so.1(+0x238008)[0x200000428008] [butte5:47379] [ 6] /usr/lib64/nvidia/libcuda.so.1(+0x434440)[0x200000624440] [butte5:47379] [ 7] /usr/lib64/nvidia/libcuda.so.1(+0x3e3c4c)[0x2000005d3c4c] [butte5:47379] [ 8] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x12f424)[0x200001cff424] [butte5:47379] [ 9] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x112bdc)[0x200001ce2bdc] [butte5:47379] [10] /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2(+0x117578)[0x200001ce7578] [butte5:47379] [11] /usr/lib64/nvidia/libcuda.so.1(+0x3d3b0c)[0x2000005c3b0c] [butte5:47379] [12] /usr/lib64/nvidia/libcuda.so.1(+0x1ef3fc)[0x2000003df3fc] [butte5:47379] [13] /usr/lib64/nvidia/libcuda.so.1(+0x392f54)[0x200000582f54] [butte5:47379] [14] /usr/lib64/nvidia/libcuda.so.1(+0xe5588)[0x2000002d5588] [butte5:47379] [15] /usr/lib64/nvidia/libcuda.so.1(+0xe5728)[0x2000002d5728] [butte5:47379] [16] /usr/lib64/nvidia/libcuda.so.1(cuLaunchKernel+0x24c)[0x2000004805ec] [butte5:47379] [17] /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2(+0xe4c4)[0x200000fce4c4] [butte5:47379] [18] /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2(cudaLaunchKernel+0x230)[0x20000101be20] [butte5:47379] [19] ../laghos/Laghos/cuda/laghos[0x10064238] [butte5:47379] [20] ../laghos/Laghos/cuda/laghos[0x1002f178] [butte5:47379] [21] ../laghos/Laghos/cuda/laghos[0x1002b438] [butte5:47379] [22] ../laghos/Laghos/cuda/laghos[0x100188c0] [butte5:47379] [23] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(monitor_main+0x128)[0x200000122da8] [butte5:47379] [24] /lib64/libc.so.6(+0x25100)[0x200001515100] [butte5:47379] [25] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000015152f4] [butte5:47379] [26] /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so(__libc_start_main+0xf0)[0x200000121e30] [butte5:47379] *** End of error message *** [ben@butte5:tls-bug]$ gdb ../laghos/Laghos/cuda/laghos butte5-laghos-47379.core GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "ppc64le-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos...(no debugging symbols found)...done. [New LWP 47379] [New LWP 47426] [New LWP 47427] [New LWP 47428] [New LWP 47449] [New LWP 47450] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `../laghos/Laghos/cuda/laghos -p 0 -m ../laghos/Laghos/data/square01_quad.mesh -'. Program terminated with signal 11, Segmentation fault. #0 0x000020000001b434 in _dl_allocate_tls_init (result=0x20002261a140) at dl-tls.c:471 471 for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt) warning: File "/usr/tce/packages/gcc/gcc-4.9.3/gnu/lib64/libstdc++.so.6.0.20-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py:/usr/lib/golang/src/pkg/runtime/runtime-gdb.py". To enable execution of this file add add-auto-load-safe-path /usr/tce/packages/gcc/gcc-4.9.3/gnu/lib64/libstdc++.so.6.0.20-gdb.py line to your configuration file "/g/g0/ben/.gdbinit". To completely disable this security protection add set auto-load safe-path / line to your configuration file "/g/g0/ben/.gdbinit". For more information about this security protection see the Missing separate debuginfos, use: debuginfo-install libibumad-43.1.1.MLNX20171122.0eb0969-0.1.43401.1.ppc64le libibverbs-41mlnx1-OFED.4.3.2.1.6.43401.1.ppc64le libmlx4-41mlnx1-OFED.4.1.0.1.0.43401.1.ppc64le libmlx5-41mlnx1-OFED.4.3.4.0.3.43401.1.ppc64le libnl3-3.2.28-4.el7.ppc64le librxe-41mlnx1-OFED.4.4.2.4.6.43401.1.ppc64le numactl-libs-2.0.9-7.el7.ppc64le opensm-libs-5.0.0.MLNX20180219.c610c42-0.1.43401.1.ppc64le openssl-libs-1.0.2k-12.el7.ppc64le ---Type <return> to continue, or q <return> to quit--- "Auto-loading safe path" section in the GDB manual. E.g., run from the shell: info "(gdb)Auto-loading safe path" (gdb) set pagination off (gdb) bt #0 0x000020000001b434 in _dl_allocate_tls_init (result=0x20002261a140) at dl-tls.c:471 #1 __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533 #2 0x00002000014b9ba0 in allocate_stack (stack=<synthetic pointer>, pdp=<synthetic pointer>, attr=0x7fffffff4980) at allocatestack.c:539 #3 __pthread_create_2_1 (newthread=0x7fffffff4940, attr=0x7fffffff4980, start_routine=0x200000097ec0 <finalize_all_thread_data>, arg=0x2000083c70f0) at pthread_create.c:447 #4 0x000020000012628c in pthread_create () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so #5 0x0000200000098640 in hpcrun_threadMgr_data_fini () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so #6 0x0000200000084bdc in hpcrun_fini_internal () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so #7 0x0000200000085558 in monitor_fini_process () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so #8 0x00002000001229c0 in monitor_end_process_fcn () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so #9 0x00002000001181b4 in monitor_signal_handler () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so #10 <signal handler called> #11 0x000020000001b434 in _dl_allocate_tls_init (result=0x20002220a140) at dl-tls.c:471 #12 __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533 #13 0x00002000014b9ba0 in allocate_stack (stack=<synthetic pointer>, pdp=<synthetic pointer>, attr=0x7fffffff5ed0) at allocatestack.c:539 #14 __pthread_create_2_1 (newthread=0x153721d8, attr=0x7fffffff5ed0, start_routine=0x200000124ba0 <monitor_begin_thread>, arg=0x200000146410 <monitor_init_tn_array+400>) at pthread_create.c:447 #15 0x0000200000126414 in pthread_create () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so #16 0x0000200000428008 in ?? () from /usr/lib64/nvidia/libcuda.so.1 #17 0x0000200000624440 in ?? () from /usr/lib64/nvidia/libcuda.so.1 #18 0x00002000005d3c4c in ?? () from /usr/lib64/nvidia/libcuda.so.1 #19 0x0000200001cff424 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2 #20 0x0000200001ce2bdc in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2 #21 0x0000200001ce7578 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2 #22 0x00002000005c3b0c in ?? () from /usr/lib64/nvidia/libcuda.so.1 #23 0x00002000003df3fc in ?? () from /usr/lib64/nvidia/libcuda.so.1 #24 0x0000200000582f54 in ?? () from /usr/lib64/nvidia/libcuda.so.1 #25 0x00002000002d5588 in ?? () from /usr/lib64/nvidia/libcuda.so.1 #26 0x00002000002d5728 in ?? () from /usr/lib64/nvidia/libcuda.so.1 #27 0x00002000004805ec in cuLaunchKernel () from /usr/lib64/nvidia/libcuda.so.1 #28 0x0000200000fce4c4 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2 #29 0x000020000101be20 in cudaLaunchKernel () from /usr/tce/packages/cuda/cuda-9.2.148/lib64/libcudart.so.9.2 #30 0x0000000010064238 in vector_op_eq(int, double, double*) () #31 0x000000001002f178 in mfem::CudaVector::CudaVector(unsigned long, double) () #32 0x000000001002b438 in mfem::hydrodynamics::LagrangianHydroOperator::LagrangianHydroOperator(int, mfem::CudaFiniteElementSpace&, mfem::CudaFiniteElementSpace&, mfem::Array<int>&, mfem::CudaGridFunction&, int, double, mfem::Coefficient*, bool, bool, double, int) () #33 0x00000000100188c0 in main () (gdb) p listp $1 = (struct dtv_slotinfo_list *) 0x0 (gdb) p result $2 = (void *) 0x20002261a140 (gdb) p dtv $3 = (dtv_t *) 0x11c93610 (gdb) p dtv[-1] $4 = {counter = 106, pointer = {val = 0x6a, is_static = false}} Created attachment 1534261 [details]
core file running under the new glibc packages
Created attachment 1534262 [details]
actual executable
Here is the perplexing part: 466 listp = GL(dl_tls_dtv_slotinfo_list); (gdb) p *_rtld_local._dl_tls_dtv_slotinfo_list $12 = {len = 69, next = 0x0, slotinfo = 0x2000022d4058} 471 for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt) So the for loop will go through 68 times leaving cnt at 69 when the loop terminates. This cnt gets moved to total. 514 total += cnt; (gdb) p total $18 = 69 but the termination condition is: 515 if (total >= GL(dl_tls_max_dtv_idx)) 516 break; (gdb) p _rtld_local._dl_tls_max_dtv_idx $19 = 74 so we don't break out which puts us in the case we increment listp whose next pointer is NULL. (gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[0].map.l_name
$36 = 0x2000000281a0 ""
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[1].map.l_name
$37 = 0x2000000281a0 ""
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[2].map.l_name
$29 = 0x200000043748 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[3].map.l_name
$30 = 0x200000046e38 "/usr/tce/packages/gcc/gcc-4.9.3/gnu/lib64/libstdc++.so.6"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[4].map.l_name
$31 = 0x200000048208 "/lib64/libc.so.6"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[5].map.l_name
$32 = 0x20000004bf60 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/elfutils-0.174-figlq6trgfl7hv3trmucrlmah6myu3yu/lib/libelf.so.1"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[6].map.l_name
$33 = 0x20000004dc10 "/usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[7].map.l_name
$34 = 0x11843ce0 "/lib64/libnuma.so.1"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[8].map.l_name
$35 = 0x1192a7f0 "/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/spectrum_mpi/mca_pml_pami.so"
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[9].map.l_name
Cannot access memory at address 0x8
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[9].map
$38 = (struct link_map *) 0x0
(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[9]
$39 = {gen = 0, map = 0x0}
(gdb) p _rtld_local._dl_tls_max_dtv_idx
$40 = 74
(gdb) p _rtld_local._dl_tls_dtv_gaps
$41 = false
So somewhere dl_tls_max_dtv_idx is getting corrupted.
Breakpoint 1 at 0x20000001b118: _dl_allocate_tls_init. (2 locations)
Missing separate debuginfos, use: debuginfo-install openssl-libs-1.0.2k-12.el7.ppc64le
(gdb) commands 1
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>p _rtld_local._dl_tls_max_dtv_idx
>c
>end
(gdb) c
Continuing.
Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533 return _dl_allocate_tls_init (mem == NULL
$1 = 6
[New Thread 0x2000034299f0 (LWP 66689)]
<snip>
Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533 return _dl_allocate_tls_init (mem == NULL
$2 = 6
[New Thread 0x200003e999f0 (LWP 66691)]
Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533 return _dl_allocate_tls_init (mem == NULL
$3 = 6
[New Thread 0x200008b599f0 (LWP 66692)]
Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533 return _dl_allocate_tls_init (mem == NULL
$4 = 8
[New Thread 0x200020a999f0 (LWP 66705)]
[laghos] MPI is NOT CUDA aware
[laghos] CUDA device count: 4
[laghos] Rank_0 => Device_0 (Tesla V100-SXM2-16GB:sm_7.0)
Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533 return _dl_allocate_tls_init (mem == NULL
$5 = 8
[New Thread 0x2000213b99f0 (LWP 66706)]
[laghos] Cartesian partitioning will be used
[laghos] pmesh->GetNE()=256
Number of kinematic (position, velocity) dofs: 2178
Number of specific internal energy dofs: 1024
Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533
533 return _dl_allocate_tls_init (mem == NULL
$6 = 74
Program received signal SIGSEGV, Segmentation fault.
0x000020000001b434 in _dl_allocate_tls_init (result=0x20002220a140) at dl-tls.c:471
471 for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt)
Missing separate debuginfos, use: debuginfo-install libibumad-43.1.1.MLNX20171122.0eb0969-0.1.43401.1.ppc64le libibverbs-41mlnx1-OFED.4.3.2.1.6.43401.1.ppc64le libmlx4-41mlnx1-OFED.4.1.0.1.0.43401.1.ppc64le libmlx5-41mlnx1-OFED.4.3.4.0.3.43401.1.ppc64le libnl3-3.2.28-4.el7.ppc64le librxe-41mlnx1-OFED.4.4.2.4.6.43401.1.ppc64le numactl-libs-2.0.9-7.el7.ppc64le opensm-libs-5.0.0.MLNX20180219.c610c42-0.1.43401.1.ppc64le
(gdb)
Hmm this looks suspicious (gdb) c Continuing. [New Thread 0x200020a999f0 (LWP 70543)] [laghos] MPI is NOT CUDA aware [laghos] CUDA device count: 4 [laghos] Rank_0 => Device_0 (Tesla V100-SXM2-16GB:sm_7.0) Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533 533 return _dl_allocate_tls_init (mem == NULL (gdb) watch _rtld_local._dl_tls_max_dtv_idx Hardware watchpoint 2: _rtld_local._dl_tls_max_dtv_idx (gdb) c Continuing. [New Thread 0x2000213b99f0 (LWP 71117)] [laghos] Cartesian partitioning will be used [laghos] pmesh->GetNE()=256 Hardware watchpoint 2: _rtld_local._dl_tls_max_dtv_idx Old value = 8 New value = 9 _dl_next_tls_modid () at dl-tls.c:104 104 } (gdb) bt #0 _dl_next_tls_modid () at dl-tls.c:104 #1 0x0000200000008044 in _dl_map_object_from_fd (name=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", origname=0x0, fd=<optimized out>, fbp=0x7fffffff34c0, realname=0x1530a370 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", loader=0x0, l_type=2, mode=-1879048191, stack_endp=0x7fffffff3820, nsid=0) at dl-load.c:1199 #2 0x000020000000be6c in _dl_map_object (loader=0x0, name=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", type=<optimized out>, trace_mode=<optimized out>, mode=<optimized out>, nsid=<optimized out>) at dl-load.c:2400 #3 0x000020000001d5b0 in dl_open_worker (a=0x7fffffff3dd0) at dl-open.c:231 #4 0x00002000000170d0 in _dl_catch_error (objname=0x7fffffff3e30, errstring=0x7fffffff3e20, mallocedp=0x7fffffff3e40, operate=0x20000001d0b0 <dl_open_worker>, args=0x7fffffff3dd0) at dl-error.c:177 #5 0x000020000001ca0c in _dl_open (file=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", mode=<optimized out>, caller_dlopen=0x200000115a20 <dlopen+128>, nsid=-2, argc=<optimized out>, argv=0x7fffffffa118, env=0x11bca9a0) at dl-open.c:649 #6 0x00002000016e1138 in dlopen_doit (a=0x7fffffff4270) at dlopen.c:66 #7 0x00002000000170d0 in _dl_catch_error (objname=0x1177f990, errstring=0x1177f998, mallocedp=0x1177f988, operate=0x2000016e10a0 <dlopen_doit>, args=0x7fffffff4270) at dl-error.c:177 #8 0x00002000016e1c18 in _dlerror_run (operate=0x2000016e10a0 <dlopen_doit>, args=0x7fffffff4270) at dlerror.c:163 #9 0x00002000016e1238 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87 #10 0x0000200000115a20 in dlopen () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so #11 0x00002000000b9978 in cupti_lm_contains_fn () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so #12 0x00002000000bbbdc in cupti_callstack_ignore_map_ignore () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so #13 0x00002000000b8240 in cupti_correlation_callback_cuda () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so #14 0x00002000000b9418 in cupti_subscriber_callback () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so #15 0x0000200001cb4218 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2 #16 0x0000200001ce4778 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2 #17 0x0000200001ce7578 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2 #18 0x00002000005c3b0c in ?? () from /usr/lib64/nvidia/libcuda.so.1 #19 0x000020000047aae0 in cuMemcpyHtoD_v2 () from /usr/lib64/nvidia/libcuda.so.1 #20 0x0000000010073b40 in mfem::rmemcpy::rHtoD(void*, void const*, unsigned long, bool) () #21 0x0000000010068ae4 in mfem::CudaFiniteElementSpace::CudaFiniteElementSpace(mfem::Mesh*, mfem::FiniteElementCollection const*, int, mfem::Ordering::Type) () #22 0x0000000010017f90 in main () Putting it all together, I sent this to the customer: ------------------- When digging around in _dl_allocate_tls_init you see that listp comes from: 466 listp = GL(dl_tls_dtv_slotinfo_list); (gdb) p *_rtld_local._dl_tls_dtv_slotinfo_list $12 = {len = 69, next = 0x0, slotinfo = 0x2000022d4058} 471 for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt) So the for loop will go through 68 times leaving cnt at 69 when the loop terminates. This cnt gets moved to total. 514 total += cnt; (gdb) p total $18 = 69 but the termination condition is: 515 if (total >= GL(dl_tls_max_dtv_idx)) 516 break; (gdb) p _rtld_local._dl_tls_max_dtv_idx $19 = 74 so we don't break out which puts us in the case we increment listp whose next pointer is NULL. but when you look at the libraries that actually use TLS there are only a few which actually use TLS there are only a few. so the 74 seemed weird especially since the vector only has 69 slots. At first I thought all the shared libs even if they didn’t have TLS were inserted into this array but when you look at other places in the dynamic linker’s code where it iterates through the ELF section headers, you can see that the only time it is inserted into this array is hen it does have TLS — which makes more sense. This is easy to confirm: (gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[0].map.l_name $36 = 0x2000000281a0 "" (gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[1].map.l_name $37 = 0x2000000281a0 "" (gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[2].map.l_name $29 = 0x200000043748 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so" (gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[3].map.l_name $30 = 0x200000046e38 "/usr/tce/packages/gcc/gcc-4.9.3/gnu/lib64/libstdc++.so.6" (gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[4].map.l_name $31 = 0x200000048208 "/lib64/libc.so.6" (gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[5].map.l_name $32 = 0x20000004bf60 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/elfutils-0.174-figlq6trgfl7hv3trmucrlmah6myu3yu/lib/libelf.so.1" (gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[6].map.l_name $33 = 0x20000004dc10 "/usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2" (gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[7].map.l_name $34 = 0x11843ce0 "/lib64/libnuma.so.1" (gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[8].map.l_name $35 = 0x1192a7f0 "/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/spectrum_mpi/mca_pml_pami.so" (gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[9].map.l_name Cannot access memory at address 0x8 Also I can see that the rest of the entries are also empty. So the question is why is _rtld_local._dl_tls_max_dtv_idx 74 then? Taking a look at the variable it appears sensible up to a point before it goes off the rails.: Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533 533 return _dl_allocate_tls_init (mem == NULL $3 = 6 [New Thread 0x200008b599f0 (LWP 66692)] Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533 533 return _dl_allocate_tls_init (mem == NULL $4 = 8 [New Thread 0x200020a999f0 (LWP 66705)] [laghos] MPI is NOT CUDA aware [laghos] CUDA device count: 4 [laghos] Rank_0 => Device_0 (Tesla V100-SXM2-16GB:sm_7.0) Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533 533 return _dl_allocate_tls_init (mem == NULL $5 = 8 [New Thread 0x2000213b99f0 (LWP 66706)] [laghos] Cartesian partitioning will be used [laghos] pmesh->GetNE()=256 Number of kinematic (position, velocity) dofs: 2178 Number of specific internal energy dofs: 1024 Breakpoint 1, __GI__dl_allocate_tls (mem=<optimized out>) at dl-tls.c:533 533 return _dl_allocate_tls_init (mem == NULL $6 = 74 Program received signal SIGSEGV, Segmentation fault. 0x000020000001b434 in _dl_allocate_tls_init (result=0x20002220a140) at dl-tls.c:471 471 for (cnt = total == 0 ? 1 : 0; cnt < listp->len; ++cnt) Watching that variable at the place where it goes off the rails, I run into: (gdb) c Continuing. [New Thread 0x2000213b99f0 (LWP 71117)] [laghos] Cartesian partitioning will be used [laghos] pmesh->GetNE()=256 Hardware watchpoint 2: _rtld_local._dl_tls_max_dtv_idx Old value = 8 New value = 9 _dl_next_tls_modid () at dl-tls.c:104 104 } (gdb) bt #0 _dl_next_tls_modid () at dl-tls.c:104 #1 0x0000200000008044 in _dl_map_object_from_fd (name=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", origname=0x0, fd=<optimized out>, fbp=0x7fffffff34c0, realname=0x1530a370 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", loader=0x0, l_type=2, mode=-1879048191, stack_endp=0x7fffffff3820, nsid=0) at dl-load.c:1199 #2 0x000020000000be6c in _dl_map_object (loader=0x0, name=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", type=<optimized out>, trace_mode=<optimized out>, mode=<optimized out>, nsid=<optimized out>) at dl-load.c:2400 #3 0x000020000001d5b0 in dl_open_worker (a=0x7fffffff3dd0) at dl-open.c:231 #4 0x00002000000170d0 in _dl_catch_error (objname=0x7fffffff3e30, errstring=0x7fffffff3e20, mallocedp=0x7fffffff3e40, operate=0x20000001d0b0 <dl_open_worker>, args=0x7fffffff3dd0) at dl-error.c:177 #5 0x000020000001ca0c in _dl_open (file=0x7fffffff5340 "/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", mode=<optimized out>, caller_dlopen=0x200000115a20 <dlopen+128>, nsid=-2, argc=<optimized out>, argv=0x7fffffffa118, env=0x11bca9a0) at dl-open.c:649 #6 0x00002000016e1138 in dlopen_doit (a=0x7fffffff4270) at dlopen.c:66 #7 0x00002000000170d0 in _dl_catch_error (objname=0x1177f990, errstring=0x1177f998, mallocedp=0x1177f988, operate=0x2000016e10a0 <dlopen_doit>, args=0x7fffffff4270) at dl-error.c:177 #8 0x00002000016e1c18 in _dlerror_run (operate=0x2000016e10a0 <dlopen_doit>, args=0x7fffffff4270) at dlerror.c:163 #9 0x00002000016e1238 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87 #10 0x0000200000115a20 in dlopen () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so #11 0x00002000000b9978 in cupti_lm_contains_fn () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so #12 0x00002000000bbbdc in cupti_callstack_ignore_map_ignore () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so #13 0x00002000000b8240 in cupti_correlation_callback_cuda () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so #14 0x00002000000b9418 in cupti_subscriber_callback () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so #15 0x0000200001cb4218 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2 #16 0x0000200001ce4778 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2 #17 0x0000200001ce7578 in ?? () from /usr/tce/packages/cuda/cuda-9.2.148/extras/CUPTI/lib64/libcupti.so.9.2 #18 0x00002000005c3b0c in ?? () from /usr/lib64/nvidia/libcuda.so.1 #19 0x000020000047aae0 in cuMemcpyHtoD_v2 () from /usr/lib64/nvidia/libcuda.so.1 #20 0x0000000010073b40 in mfem::rmemcpy::rHtoD(void*, void const*, unsigned long, bool) () #21 0x0000000010068ae4 in mfem::CudaFiniteElementSpace::CudaFiniteElementSpace(mfem::Mesh*, mfem::FiniteElementCollection const*, int, mfem::Ordering::Type) () #22 0x0000000010017f90 in main () The most interesting things in there for me are: #10 0x0000200000115a20 in dlopen () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/ext-libs/libmonitor.so #11 0x00002000000b9978 in cupti_lm_contains_fn () from /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/hpctoolkit/spack/opt/spack/linux-rhel7-ppc64le/gcc-7.3.1/hpctoolkit-gpu-id5dbapvg77jnqydhnk5ko2j4i43unwr/lib/hpctoolkit/libhpcrun.so cupti is calling dlopen through libmonitor.so and something goes wrong because we have: _dlerror_run and _dl_catch_error and in the process _dl_tls_max_dtv_idx gets incremented even though nothing is being added to the _rtld_local._dl_tls_dtv_slotinfo_list Oddly it seems like what is being dlopened over and over is: /g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos anyway that is as far as I’ve gotten. I’m out of the TLS code now and over looking at the dlopen error handling code. I haven’t yet figured out what the precise problem is but I wanted to give you an update so far and ask if there is any possibility that something libmonitor might be doing may have triggered the problem. -ben ---------------- He wrote back: Thanks so much for your help. I think that your work has paid off! I think that we have two problems: a bug in hpctoolkit and a bug in the loader. Based on your detective work below, we looked at the implementation of cupti_callstack_ignore_map_ignore and the functions it calls. We want to ignore procedure frames that belong to any of three NVIDIA load modules libcuda, libcupti, and libcuda_rt. We ask whether a procedure frame belongs to one of nvidia’s load modules by checking whether it is in the same shared library where some known nvidia functions are defined. (We don’t want to depend on the names of the libraries or the paths to them since these are not standard.) If a load module belongs to nvidia, we remember it in a data structure called cupti_callstack_ignore_map. In looking at our code for cupti_lm_contains_fn, in the following file: https://github.com/HPCToolkit/hpctoolkit/blob/ompt-device-cfg-master/src/tool/hpcrun/sample-sources/nvidia/cupti-api.c we found that it contains a dlopen without a matching dlclose. That clearly seems to be a bug in our code. The issue for the loader might be that when the same load module is opened several times, the TLS data structures get corrupted. We are swamped at present but will make the change to our code, test it, and see if it resolves our problem. I think there is still a bug in the loader that our code trips. We can improve our code to not test the load map. --------------- I’ll keep looking to see if I can figure out why dlopen()’ing an already open file increments the _rtld_local._dl_tls_max_dtv_idx even when nothing new is added. That seems like a much simpler problem. It occurs to me that this could really easily be a cross platform latent bug that no one uncovered before because what you would need to do is do whatever triggers the accounting bug (probably opening an already open object that uses TLS) > 70-<number of libs that use TLS> times. If a program did it fewer times, the code in _dl_allocate_tls_init is robust enough that it will just skip over those unused entries in the slotinfo table without causing a problem. Ben, The team and I went through the logs for this bug again and we believe this might have been fixed in rhel-7.8: https://bugzilla.redhat.com/show_bug.cgi?id=1740039 The reason we think it's this bug is because we see: open("/g/g0/ben/Work/TLS-bug/bugs/tls-bug-reproducer/laghos/Laghos/cuda/laghos", O_RDONLY|O_CLOEXEC) = 59 in the traces. This looks like the application is trying to dlopen copies of itself, so we think this particular failure might be fixed. In the case of bug 1740039 we had a good reproducer we could use to track down the failure. Are you able to try this again with rhel-7.8 and see if you can reproduce the instability? We got this one. The customer was VERY happy. Yes the problem was related to dlopen'ing copies of itself. "Thanks so much for your help. I think that your work has paid off! I think that we have two problems: a bug in hpctoolkit and a bug in the loader. Based on your detective work below, we looked at the implementation of cupti_callstack_ignore_map_ignore and the functions it calls. We want to ignore procedure frames that belong to any of three NVIDIA load modules libcuda, libcupti, and libcuda_rt. We ask whether a procedure frame belongs to one of nvidia’s load modules by checking whether it is in the same shared library where some known nvidia functions are defined. (We don’t want to depend on the names of the libraries or the paths to them since these are not standard.) If a load module belongs to nvidia, we remember it in a data structure called cupti_callstack_ignore_map. In looking at our code for cupti_lm_contains_fn, in the following file: https://github.com/HPCToolkit/hpctoolkit/blob/ompt-device-cfg-master/src/tool/hpcrun/sample-sources/nvidia/cupti-api.c we found that it contains a dlopen without a matching dlclose. That clearly seems to be a bug in our code. The issue for the loader might be that when the same load module is opened several times, the TLS data structures get corrupted. We are swamped at present but will make the change to our code, test it, and see if it resolves our problem. I think there is still a bug in the loader that our code trips. We can improve our code to not test the load map." Even at that time, my attempts to trigger the problem by just repeatedly dlopening the same file over and over, didn't cause the problem to reproduce. Because the customer was happy and I couldn't figure out how to reproduce the problem, I didn't come back to trying to find out exactly why this triggered a problem in the TLS in their case. It evidently was subtle interaction between the threads and the number of libraries with TLS and how many times they were opened. Ben, We are going to consider this bug fixed in rhel-7.8 then, and if we hit this again we can open another bug with another detailed analysis of the failure. I'm marking this CLOSED/DUPLICATE. *** This bug has been marked as a duplicate of bug 1740039 *** |