Bug 1819001
Summary: | gdb cannot debug elf binary with large PT_LOAD segment | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Jeff Bastian <jbastian> |
Component: | gdb | Assignee: | Keith Seitz <keiths> |
gdb sub component: | system-version | QA Contact: | qe-baseos-tools-bugs |
Status: | CLOSED UPSTREAM | Docs Contact: | |
Severity: | low | ||
Priority: | low | CC: | codonell, dsmith, efuller, gdb-bugs, jhladky, keiths, ohudlick |
Version: | 8.2 | Keywords: | Triaged |
Target Milestone: | rc | ||
Target Release: | 8.3 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-04-15 19:01:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jeff Bastian
2020-03-30 22:20:49 UTC
(In reply to Jeff Bastian from comment #0) > Description of problem: *Thank you* for the excellent repoducer! > Actual results: > gdb does not help to explain why the binary is crashing so early > > Expected results: > gdb gives some helpful clues? A data point: $ /usr/bin/gdb -q -ex r --args /lib64/ld-linux-x86-64.so.2 bin/ft.D.x Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/usr/lib64/ld-2.28.so.debug...done. done. Starting program: /usr/lib64/ld-linux-x86-64.so.2 bin/ft.D.x bin/ft.D.x: error while loading shared libraries: bin/ft.D.x: cannot map zero-fill pages [Inferior 1 (process 7427) exited with code 0177] If we attempt to catch mmap system call, we discover (or at least I do!) that we're exiting with the syscall exit_group. Catching that shows us where the problem is: (gdb) catch syscall exit_group Catchpoint 1 (syscall 'exit_group' [231]) (gdb) r Starting program: /usr/lib64/ld-linux-x86-64.so.2 bin/ft.D.x bin/ft.D.x: error while loading shared libraries: bin/ft.D.x: cannot map zero-fill pages Catchpoint 1 (call to syscall exit_group), __GI__exit (status=status@entry=127) at ../sysdeps/unix/sysv/linux/_exit.c:31 31 INLINE_SYSCALL (exit_group, 1, status); (gdb) bt #0 __GI__exit (status=status@entry=127) at ../sysdeps/unix/sysv/linux/_exit.c:31 #1 0x00007ffff7deee57 in fatal_error (errcode=<optimized out>, objname=<optimized out>, occasion=<optimized out>, errstring=0x7ffff7df5b00 "cannot map zero-fill pages") at dl-error-skeleton.c:78 #2 0x00007ffff7deef1a in _dl_signal_error (errcode=errcode@entry=0, objname=objname@entry=0x7fffffffda91 "bin/ft.D.x", occation=occation@entry=0x0, errstring=errstring@entry=0x7ffff7df5b00 "cannot map zero-fill pages") at dl-error-skeleton.c:124 #3 0x00007ffff7dda257 in lose (code=code@entry=0, fd=fd@entry=3, name=name@entry=0x7fffffffda91 "bin/ft.D.x", realname=realname@entry=0x7ffff7ffe150 "bin/ft.D.x", l=l@entry=0x7ffff7ffe160, msg=0x7ffff7df5b00 "cannot map zero-fill pages", r=0x7ffff7ffe120 <_r_debug>, nsid=0) at dl-load.c:851 #4 0x00007ffff7ddabb2 in _dl_map_object_from_fd ( name=name@entry=0x7fffffffda91 "bin/ft.D.x", origname=origname@entry=0x0, fd=<optimized out>, fbp=fbp@entry=0x7fffffffd090, realname=<optimized out>, loader=loader@entry=0x0, l_type=<optimized out>, mode=<optimized out>, stack_endp=<optimized out>, nsid=<optimized out>) at dl-load.c:888 #5 0x00007ffff7ddd34a in _dl_map_object (loader=loader@entry=0x0, name=0x7fffffffda91 "bin/ft.D.x", type=type@entry=0, trace_mode=trace_mode@entry=0, mode=mode@entry=536870912, nsid=nsid@entry=0) at dl-load.c:2251 #6 0x00007ffff7dd882c in dl_main (phdr=<optimized out>, phnum=8, user_entry=0x7fffffffd658, auxv=<optimized out>) at rtld.c:1061 #7 0x00007ffff7dee11f in _dl_sysdep_start ( start_argptr=start_argptr@entry=0x7fffffffd730, dl_main=dl_main@entry=0x7ffff7dd6560 <dl_main>) at ../elf/dl-sysdep.c:253 #8 0x00007ffff7dd6118 in _dl_start_final (arg=0x7fffffffd730) at rtld.c:413 #9 _dl_start (arg=0x7fffffffd730) at rtld.c:520 #10 0x00007ffff7dd5058 in _start () I don't think there is much gdb can do about this... I suspected there wasn't much gdb could do, but I actually learned gdb was more capable than I realized. How did I never know gdb could catch syscalls? Thank you for teaching me a new trick! This is useful info for bug 1817106. (In reply to Keith Seitz from comment #1) > $ /usr/bin/gdb -q -ex r --args /lib64/ld-linux-x86-64.so.2 bin/ft.D.x When you run under the loader directly the loader is responsible for the mappings (not the kernel) and so we get a graceful exit. > I don't think there is much gdb can do about this... What happens when you debug the process directly? (In reply to Carlos O'Donell from comment #3) > (In reply to Keith Seitz from comment #1) > > $ /usr/bin/gdb -q -ex r --args /lib64/ld-linux-x86-64.so.2 bin/ft.D.x > > When you run under the loader directly the loader is responsible for the > mappings (not the kernel) and so we get a graceful exit. > > > I don't think there is much gdb can do about this... > > What happens when you debug the process directly? To be clear, the supposition right now is that the kernel is artificially delivering the SIGSEGV at exec time. (In reply to Carlos O'Donell from comment #4) > (In reply to Carlos O'Donell from comment #3) > > (In reply to Keith Seitz from comment #1) > > > $ /usr/bin/gdb -q -ex r --args /lib64/ld-linux-x86-64.so.2 bin/ft.D.x > > > > When you run under the loader directly the loader is responsible for the > > mappings (not the kernel) and so we get a graceful exit. > > > > > I don't think there is much gdb can do about this... > > > > What happens when you debug the process directly? > > To be clear, the supposition right now is that the kernel is artificially > delivering the SIGSEGV at exec time. And if so, how do you improve this use case? (In reply to Carlos O'Donell from comment #3) > (In reply to Keith Seitz from comment #1) > > $ /usr/bin/gdb -q -ex r --args /lib64/ld-linux-x86-64.so.2 bin/ft.D.x > > When you run under the loader directly the loader is responsible for the > mappings (not the kernel) and so we get a graceful exit. > > > I don't think there is much gdb can do about this... > > What happens when you debug the process directly? Normally, we (as in "gdb") should be able to stop in the startup code, but not in this case: (gdb) b *_start Breakpoint 1 at 0x400af0 (gdb) r Starting program: /home/rhel8/rhbz/1819001/NPB3.3.1/NPB3.3-OMP/bin/ft.D.x During startup program terminated with signal SIGSEGV, Segmentation fault. That's a pretty serious indicator that something is amiss. If we enable some debugging: $ /usr/bin/gdb -q bin/ft.D.x Reading symbols from bin/ft.D.x...done. (gdb) set startup-with-shell 0 (gdb) set debug lin-lwp 1 (gdb) r Starting program: /home/rhel8/rhbz/1819001/NPB3.3.1/NPB3.3-OMP/bin/ft.D.x sigchld linux_nat_wait: [process 14314], [] LLW: enter LNW: waitpid(-1, ...) returned 14314, ERRNO-OK LLW: waitpid 14314 received Segmentation fault (stopped) LNW: waitpid(-1, ...) returned 0, ERRNO-OK RSRL: NOT resuming LWP process 14314, has pending status LLW: exit LLR: Preparing to resume process 14314, Segmentation fault, inferior_ptid process 14314 LLR: PTRACE_CONT process 14314, Segmentation fault (resume event thread) linux_nat_wait: [process 14314], [] RSRL: NOT resuming LWP process 14314, not stopped LLW: enter LNW: waitpid(-1, ...) returned 0, ERRNO-OK RSRL: NOT resuming LWP process 14314, not stopped linux-nat: about to sigsuspend sigchld LNW: waitpid(-1, ...) returned 14314, ERRNO-OK LLW: waitpid 14314 received Segmentation fault (terminated) LWP 14314 exited (resumed=1) LNW: waitpid(-1, ...) returned -1, No child processes RSRL: NOT resuming LWP process 14314, has pending status LLW: exit During startup program terminated with signal SIGSEGV, Segmentation fault. The very first call to waitpid shows the process has been terminated. The returned wstatus is 0xb7f. That really doesn't tell us much other than the process stopped with a segmentation fault, as reported. Unless there is more info to be gleaned from somewhere, I am not entirely sure what we can do other than make the error message even more verbose. dmesg reports a little bit more about the underlying problem: [Wed Apr 1 15:06:29 2020] ft.D.x[14038]: segfault at 7ffff4a7353b ip 00007ffff> [Wed Apr 1 15:06:29 2020] Code: Bad RIP value. Is that information particularly more useful, though, even to the above average user? (In reply to Keith Seitz from comment #6) > During startup program terminated with signal SIGSEGV, Segmentation fault. This isn't quite true because "startup" never happened. The message that would be more accurate is: "The operating system kernel has terminated the process *before* startup with signal SIGSEGV, segmentation fault." Could we reliably print something like that? Pedro, David, and I have discussed, and we are not certain there is anything that can be done to improve the user experience here in the short-term. Moving upstream. |