Bug 1817106
Summary: | glibc: ld.so appears to segfault when failing to load very large PT_LOAD segment. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Jeff Bastian <jbastian> | ||||
Component: | glibc | Assignee: | glibc team <glibc-bugzilla> | ||||
Status: | CLOSED CANTFIX | QA Contact: | qe-baseos-tools-bugs | ||||
Severity: | low | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 8.2 | CC: | ashankar, codonell, dj, efuller, fweimer, jhladky, mnewsome, pfrankli, sipoyare | ||||
Target Milestone: | rc | ||||||
Target Release: | 8.3 | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-04-02 13:41:32 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jeff Bastian
2020-03-25 15:19:23 UTC
Output when using valgrind (run on https://beaker.engineering.redhat.com/view/gold-1s.tpb.lab.eng.brq.redhat.com#details with 48 GiB RAM) $ valgrind --log-file=ft.D.x.valgrind --tool=memcheck --leak-check=yes -v --leak-check=full --show-reachable=yes NPB_sources/bin/ft.D.x NAS Parallel Benchmarks (NPB3.3-OMP) - FT Benchmark Size : 2048x1024x1024 Iterations : 25 Number of available threads : 24 Program received signal SIGSEGV: Segmentation fault - invalid memory reference. Backtrace for this error: Program received signal SIGABRT: Process abort signal. Backtrace for this error: Program received signal SIGABRT: Process abort signal. Backtrace for this error: Segmentation fault (core dumped) I'm attaching valgrind log file as well. gdb on valgrind.core $ gdb NPB_sources/bin/ft.D.x ft.D.x.valgrind.core.4724 Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000000005ce0c4d in bigarrays_ () from /lib64/libpthread.so.0 [Current thread is 1 (Thread 0x8ace700 (LWP 4728))] Created attachment 1673559 [details]
valgrind log file
$ valgrind --log-file=ft.D.x.valgrind --tool=memcheck --leak-check=yes -v --leak-check=full --show-reachable=yes NPB_sources/bin/ft.D.x
Server: gold-1s.tpb.lab.eng.brq.redhat.com
kernel 4.18.0-187.el8.x86_64
glibc-2.28-101.el8.x86_64
libgomp-8.3.1-5.el8.x86_64
See also sibling bug 1817111 about ldd misbehaving on this binary. Jeff, Thanks for submitting this issue. We should not segfault here, we should gracefully exit. ~~~ writev(2, [{iov_base="bin/ft.D.x", iov_len=10}, {iov_base=": ", iov_len=2}, {iov_base="error while loading shared libra"..., iov_len=36}, {iov_base=": ", iov_len=2}, {iov_base="bin/ft.D.x", iov_len=10}, {iov_base=": ", iov_len=2}, {iov_base="cannot map zero-fill pages", iov_len=26}, {iov_base="", iov_len=0}, {iov_base="", iov_len=0}, {iov_base="\n", iov_len=1}], 10bin/ft.D.x: error while loading shared libraries: bin/ft.D.x: cannot map zero-fill pages ) = 89 exit_group(127) = ? +++ exited with 127 +++ ~~~ We'll look into this. Is this blocking a customer issue? Hi Carlos, no, this is not blocking any customer issue. Thanks Jirka Would you please clarify why you think that the SIGSEGV is generated by the glibc dynamic loader? C reproducer: char large_data[128 * 1024LL * 1024 * 1024]; int main (void) { } (Adjust the array size if necessary.) We can build a pseudo-dynamic-linker from nortld.S: .text .globl _start _start: ud2 Like this: gcc -shared -nostdlib -nostartfiles -o nortld.so nortld.S This will crash with SIGILL when executed. We can link the C reproducer against that: gcc -Wl,--dynamic-linker=./nortld.so reproducer.c It still crashes with SIGSEGV, not SIGILL: ./a.out Segmentation fault This suggests to me that the SIGSEGV is synthesized by the kernel, and the dynamic loader never starts running. My understanding of how a binary gets loaded and starts running is a bit fuzzy, so it may not be the dynamic loader that's crashing, but I thought that was one of the first steps. If I'm wrong -- which is very likely given your example in comment 9 -- feel free to change the BZ component and $subject. Please check also the valgrind output in comment #1 and #2. Perhaps the problem is in libpthread? gdb on valgrind.core $ gdb NPB_sources/bin/ft.D.x ft.D.x.valgrind.core.4724 Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000000005ce0c4d in bigarrays_ () from /lib64/libpthread.so.0 [Current thread is 1 (Thread 0x8ace700 (LWP 4728))] (In reply to Jiri Hladky from comment #12) > Please check also the valgrind output in comment #1 and #2. I think this could be a different issue. > Perhaps the problem is in libpthread? > > gdb on valgrind.core > $ gdb NPB_sources/bin/ft.D.x ft.D.x.valgrind.core.4724 > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x0000000005ce0c4d in bigarrays_ () from /lib64/libpthread.so.0 > [Current thread is 1 (Thread 0x8ace700 (LWP 4728))] There is no bigarrays_ function in libpthread, so this looks rather iffy. I think we will need your help to identify the right component. valgrind shows the following (the full valgrind output is attached)[1]. Is it of any help? Florian, could you please advise what is the correct component or how should we find it out? [1] ==4724== ERROR SUMMARY: 25 errors from 3 contexts (suppressed: 0 from 0) ==4724== ==4724== 1 errors in context 1 of 3: ==4724== Invalid write of size 8 ==4724== at 0x401EA8: init_ui_._omp_fn.7 (ft.f:193) ==4724== by 0x564E6A5: GOMP_parallel (parallel.c:171) ==4724== by 0x40529C: init_ui_ (ft.f:190) ==4724== by 0x400EB2: ft (ft.f:107) ==4724== by 0x400EB2: main (ft.f:167) ==4724== Address 0x801610450 is not stack'd, malloc'd or (recently) free'd ==4724== ==4724== ==4724== 23 errors in context 2 of 3: ==4724== Thread 12: ==4724== Invalid write of size 8 ==4724== at 0x401E98: init_ui_._omp_fn.7 (ft.f:192) ==4724== by 0x565843D: gomp_thread_start (team.c:123) ==4724== by 0x5CD62DD: start_thread (pthread_create.c:486) ==4724== by 0x5FE9E82: clone (clone.S:95) ==4724== Address 0x3b2d74420 is not stack'd, malloc'd or (recently) free'd ==4724== ==4724== ERROR SUMMARY: 25 errors from 3 contexts (suppressed: 0 from 0) Some more info from Keith Seitz in gdb bug 1819001 comment 1: A data point: $ /usr/bin/gdb -q -ex r --args /lib64/ld-linux-x86-64.so.2 bin/ft.D.x Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/usr/lib64/ld-2.28.so.debug...done. done. Starting program: /usr/lib64/ld-linux-x86-64.so.2 bin/ft.D.x bin/ft.D.x: error while loading shared libraries: bin/ft.D.x: cannot map zero-fill pages [Inferior 1 (process 7427) exited with code 0177] If we attempt to catch mmap system call, we discover (or at least I do!) that we're exiting with the syscall exit_group. Catching that shows us where the problem is: (gdb) catch syscall exit_group Catchpoint 1 (syscall 'exit_group' [231]) (gdb) r Starting program: /usr/lib64/ld-linux-x86-64.so.2 bin/ft.D.x bin/ft.D.x: error while loading shared libraries: bin/ft.D.x: cannot map zero-fill pages Catchpoint 1 (call to syscall exit_group), __GI__exit (status=status@entry=127) at ../sysdeps/unix/sysv/linux/_exit.c:31 31 INLINE_SYSCALL (exit_group, 1, status); (gdb) bt #0 __GI__exit (status=status@entry=127) at ../sysdeps/unix/sysv/linux/_exit.c:31 #1 0x00007ffff7deee57 in fatal_error (errcode=<optimized out>, objname=<optimized out>, occasion=<optimized out>, errstring=0x7ffff7df5b00 "cannot map zero-fill pages") at dl-error-skeleton.c:78 #2 0x00007ffff7deef1a in _dl_signal_error (errcode=errcode@entry=0, objname=objname@entry=0x7fffffffda91 "bin/ft.D.x", occation=occation@entry=0x0, errstring=errstring@entry=0x7ffff7df5b00 "cannot map zero-fill pages") at dl-error-skeleton.c:124 #3 0x00007ffff7dda257 in lose (code=code@entry=0, fd=fd@entry=3, name=name@entry=0x7fffffffda91 "bin/ft.D.x", realname=realname@entry=0x7ffff7ffe150 "bin/ft.D.x", l=l@entry=0x7ffff7ffe160, msg=0x7ffff7df5b00 "cannot map zero-fill pages", r=0x7ffff7ffe120 <_r_debug>, nsid=0) at dl-load.c:851 #4 0x00007ffff7ddabb2 in _dl_map_object_from_fd ( name=name@entry=0x7fffffffda91 "bin/ft.D.x", origname=origname@entry=0x0, fd=<optimized out>, fbp=fbp@entry=0x7fffffffd090, realname=<optimized out>, loader=loader@entry=0x0, l_type=<optimized out>, mode=<optimized out>, stack_endp=<optimized out>, nsid=<optimized out>) at dl-load.c:888 #5 0x00007ffff7ddd34a in _dl_map_object (loader=loader@entry=0x0, name=0x7fffffffda91 "bin/ft.D.x", type=type@entry=0, trace_mode=trace_mode@entry=0, mode=mode@entry=536870912, nsid=nsid@entry=0) at dl-load.c:2251 #6 0x00007ffff7dd882c in dl_main (phdr=<optimized out>, phnum=8, user_entry=0x7fffffffd658, auxv=<optimized out>) at rtld.c:1061 #7 0x00007ffff7dee11f in _dl_sysdep_start ( start_argptr=start_argptr@entry=0x7fffffffd730, dl_main=dl_main@entry=0x7ffff7dd6560 <dl_main>) at ../elf/dl-sysdep.c:253 #8 0x00007ffff7dd6118 in _dl_start_final (arg=0x7fffffffd730) at rtld.c:413 #9 _dl_start (arg=0x7fffffffd730) at rtld.c:520 #10 0x00007ffff7dd5058 in _start () (In reply to Jeff Bastian from comment #15) > Some more info from Keith Seitz in gdb bug 1819001 comment 1: > > A data point: > > $ /usr/bin/gdb -q -ex r --args /lib64/ld-linux-x86-64.so.2 bin/ft.D.x > Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from > /usr/lib/debug/usr/lib64/ld-2.28.so.debug...done. > done. > Starting program: /usr/lib64/ld-linux-x86-64.so.2 bin/ft.D.x > bin/ft.D.x: error while loading shared libraries: bin/ft.D.x: cannot map > zero-fill pages > [Inferior 1 (process 7427) exited with code 0177] This is not a sigsegv. The loader exited correctly and gave you an appropriate error message. The exit code was 127 e.g. command not found (not quite accurate). The question is: What is happening during the SIGSEGV cases? Florian's comment here: https://bugzilla.redhat.com/show_bug.cgi?id=1817106#c9 Indicates that we never get to userspace (the illegal instruction is never run) and so it may be in the kernel's binfmt_elf support where it tries to map the object in and delivers the SIGSEGV because of the large PT_LOAD segment. Therefore there is no way for us to recover from that. (In reply to Jiri Hladky from comment #14) > I think we will need your help to identify the right component. > > valgrind shows the following (the full valgrind output is attached)[1]. Is > it of any help? I think all the valgrind issues reported here so far are different bugs (if they are bugs at all). The kernel does not actually run any userspace code in this case, so there can't be anything for valgrind to report. > Florian, could you please advise what is the correct component or how should > we find it out? I filed kernel bug 1820095 for the confusing segfault (mentioned in the summary of this bug). Beyond that, it's not clear to me what else we can do. With an explicit loader invocation, glibc already prints are fairly accurate error message (“cannot map zero-fill pages”). And as I said, the valgrind issues discussed here are something else and do not really point to glibc problems either (the one trace which mentions libpthread seems to have hit some missing/incorrect debuginfo). I suggest we close this bug as CANTFIX. If you want to track down the valgrind issues, you should valgrind bugs. Sorry. Thank you, Florian!
> I suggest we close this bug as CANTFIX. If you want to track down the valgrind issues, you should valgrind bugs. Sorry.
OK, I understand.
@Jeff - I will let the final decision on you.
Jirka
I expected this might be a CANTFIX bug since it's a rather odd corner case. But I appreciate everyone taking time to dig into this, and I learned a few new debugging tricks in the process. |