Created attachment 362871 [details] the reproducer's sources Description of problem: ltrace doesn't properly handle threads, so that kernel sents SIGTRAP signal to traced process with default disposition (SIG_DFL), which leads to premature process termination. The real reason of this bug is that the linux kernel thinks that we don't want to trace a thread, because of wrong/unimplemented ptrace option (PTRACE_O_TRACECLONE). See the section `Additional info' for investigation results. Version-Release number of selected component (if applicable): RHEL5 (0.5-13.45svn.el5), also on RHEL4 and Fedora 11 How reproducible: almost always Steps to Reproduce: 1. Build the attached reproducer: $ gcc -o reproducer reproducer.c -lpthread 2. Run the reproducer (or run it repeatedly till you see the SIGTRAP): $ ltrace ./reproducer Actual results: $ ltrace ./reproducer __libc_start_main(0x4006b3, 1, 0x7fff7ce47e78, 0x400710, 0x400700 <unfinished ...> pthread_create(0x7fff7ce47d88, 0, 0x40065c, 0, 0x3959569300 <unfinished ...> +++ killed by SIGTRAP +++ $ Expected results: ltrace shouldn't be killed Additional info: If we look at ltrace's debug output using command line option `-dd': === snip === ... DEBUG: process_event.c:109: process_event(): event: syscall (SYS_clone [56]) *** 1) DEBUG: breakpoints.c:128: disable_all_breakpoints(): Disabling breakpoints for pid 32352... *** 2) DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400508) DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400518) DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400528) DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400538) DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400548) DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400558) DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400599) DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x4006cf) DEBUG: process_event.c:115: process_event(): event: sysret (SYS_clone [56]) *** 3) DEBUG: process_event.c:109: process_event(): event: syscall (SYS_futex [202]) *** 4) DEBUG: breakpoints.c:109: enable_all_breakpoints(): Enabling breakpoints for pid 32352... *** 5) DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400508) DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400518) DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400528) DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400538) DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400548) DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400558) DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400599) DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x4006cf) DEBUG: process_event.c:103: process_event(): event: exit signal (SIGTRAP [5]) *** 6) +++ killed by SIGTRAP +++ DEBUG: process_event.c:162: remove_proc(): Removing pid 32352 ... === end of snip === We can see in 1) entering clone syscall, thus removing all breakpoints 2) from the program, because ltrace does checking for system calls that may create childs and removes breakpoints before letting a traced program to continue. Ltrace wrongly considers there a thread as a child, but this is due to a previous implementation of threads - linuxthreads, where threads were de facto distinct processes. But this isn't the problem itself. 3) show us the return from pthread_create() and 4) calling futex syscall means waiting in pthread_join() for notiffication from kernel about thread termination. The problem arises during 5) when all breakpoints in traced program are turned on again due to previous futex syscall, because syscall isn't considered as system call that can create childs. You can find this check/bug in process_event()->process_syscall()->fork_p(). 6) The problem itself, the created thread stepped onto breakpoint, we can see this from produced coredump: Core was generated by `./reproducer'. Program terminated with signal 5, Trace/breakpoint trap. #0 0x0000000000400509 in printf@plt () (gdb) bt #0 0x0000000000400509 in printf@plt () #1 0x00000000004006a0 in print_tid () at reproducer.c:16 #2 0x0000003959e0686a in start_thread () from /lib64/libpthread.so.0 #3 0x00000039592de25d in clone () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () (gdb) x/i 0x0000000000400509 0x400509 <printf@plt+1>: and $0x2005aa,%eax (gdb) x/i 0x0000000000400508 0x400508 <printf@plt>: int3 (gdb) disas 0x400508 Dump of assembler code for function printf@plt: 0x0000000000400508 <printf@plt+0>: int3 0x0000000000400509 <printf@plt+1>: and $0x2005aa,%eax 0x000000000040050e <printf@plt+6>: pushq $0x0 0x0000000000400513 <printf@plt+11>: jmpq 0x4004f8 End of assembler dump. (gdb) info registers rax 0x0 0 rbx 0x0 0 rcx 0x22e 558 rdx 0x3d33 15667 rsi 0x22e 558 rdi 0x4007f8 4196344 rbp 0x7f7f6ad7a100 0x7f7f6ad7a100 rsp 0x7f7f6ad7a0e8 0x7f7f6ad7a0e8 r8 0x4007fe 4196350 r9 0x1 1 r10 0x22 34 r11 0x202 514 r12 0x3959e10d80 246321057152 r13 0x0 0 r14 0x0 0 r15 0x3 3 rip 0x400509 0x400509 <printf@plt+1> eflags 0x297 [ CF PF AF SF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 fctrl 0x37f 895 fstat 0x0 0 ftag 0xffff 65535 fiseg 0x0 0 fioff 0x0 0 foseg 0x0 0 fooff 0x0 0 fop 0x0 0 mxcsr 0x1f80 [ IM DM ZM OM UM PM ] (gdb) If we dig into kernel sources we can see, as I mentioned earlier, that the debugged thread isn't set in kernel to be traced, so that kernel kills it with SIGTRAP and disposition SIG_DFL. In the latest ltrace version from git, this is fixed/implemented already, you can find first mention about this at: http://git.debian.org/?p=collab-maint/ltrace.git;a=commit;h=2721e6adbc7399e1b621b896b4f5b864082fbf03 but it really works from this commit: http://git.debian.org/?p=collab-maint/ltrace.git;a=commit;h=6be8028492b29f68126ce413b7ef4b875c7bb017 I have created a patch based on the diff-comits with small changes to preserve naming convention of older sources from RHEL5 (+ removing some parts of previous patches (now useless) + fixing compilation errors). The patch fixes this bug and also passes all the tests (make check). Another problem still persists even with the latest sources, the ltrace's author haven't implemented yet the support for having multiple threads sharing the same code with breakpoints inserted there: http://lists.alioth.debian.org/pipermail/ltrace-devel/2006-April/000035.html http://lists.alioth.debian.org/pipermail/ltrace-devel/2006-April/000036.html But it seems that he is planning to fix this in a near future: http://git.debian.org/?p=collab-maint/ltrace.git;a=blob;f=TODO;h=5063737726b1f1ecdfe59159b722f0807497f8f6;hb=163765561621a9127cee3243dd94011e70a2c580
Created attachment 362874 [details] temporary patch
Also occurs on Fedora Core 13
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
Confirmed that the upstream version fixes this issue. I'll look into how difficult would it be to port this over to RHEL (or rather the other way around, to merge our patches upstream; note we cannot simply rebase, ltrace hasn't been able to pass its own testsuite for quite some time now).
Just a note, upstream started 0.6 release and is looking for ltrace patches laying all around. Seems like a good time to merge RHEL & Fedora patches upstream.
Created attachment 515874 [details] Source RPM with rebased patchset This src rpm includes the complete patch set. I'll spin scratch binaries next and link them here. This works on Fedora 14 on x86_64, I haven't had a chance to check this anywhere else yet. Test case is included.
Created attachment 516201 [details] Source RPM with rebased patchset On RHEL 5, process status of stopped process is always reported as "T" in /proc/pid/status, regardless of whether it's in tracing or job control stop. The status is clarified further on the line. This SRPM contains an additional patch that implements this. Multi-threaded tracing now works for me on x86_64 RHEL 5 machine. Scratch build for x86_64 here: https://brewweb.devel.redhat.com/taskinfo?taskID=3530264
(In reply to comment #45) > Created attachment 516201 [details] > Source RPM with rebased patchset > > On RHEL 5, process status of stopped process is always reported as "T" in > /proc/pid/status, regardless of whether it's in tracing or job control stop. > The status is clarified further on the line. This SRPM contains an additional > patch that implements this. Multi-threaded tracing now works for me on x86_64 > RHEL 5 machine. Dear Tanaka-san, Would you please rebuild attached SRPM on x86_64 and verify it, if possible? Best Regards, Masaki Furuta
Verified that the same SRPM works also on i386, ppc and ppc64.
Created attachment 516376 [details] Source RPM with rebased patchset This includes an extra patch that fixes a race between waiting for SIGSTOP to be delivered to all tasks, and one of those tasks exiting. Furthermore, I verified that multi-threaded racing also works on s390 and s390x.
Created attachment 516786 [details] Source RPM with rebased patchset This SRPM contains one additional correctness fix and two race condition fixes.
Created attachment 516896 [details] Additional compilation fix for ia64
Hi Furuta-san, NEC confirmed the fix with the test package. Thanks, Ken'ichi Tanaka
Created attachment 518786 [details] Partial fix This is partial fix that removes a race between obtaining list of tasks and attaching to those tasks. This fixes SIGSEGVs that ltrace was getting and copious "can't attach" messages that it was putting out when it tried to attach to processes with many (100-ish) threads. I still see a different attach problem, where the tracee dies of SIGTRAP when ltrace attempts to attach. Similarly, detach problems were not addressed yet.
Created attachment 519070 [details] Additional fix for attach This is additional patch that amends the previous "Partial fix" and fixes attach to many-threaded processes. The gist of the addressed problem is that some tasks were still running unattached while the breakpoints were already inserted.
Created attachment 519156 [details] Additional fix for detach This is additional patch that fixes a number of bugs in detach logic that would leave tracee in inconsistent state (with pending events, instruction pointers pointing mid-instruction, breakpoints left inside, and perhaps others). As a result, the tracee would be killed, or could generally change behavior as the result of tracing. I have ltrace with this fix spinning on my Fedora machine, attaching and detaching to a multi-threaded process, and it seems to be stable and well-behaved, at least on x86_64. I'll do more thorough cross-arch testing on Monday.
Created attachment 522392 [details] Additional fix Additional fix for proper syscall detection, ported from upstream.
(In reply to comment #71) > Created attachment 522392 [details] > Additional fix > > Additional fix for proper syscall detection, ported from upstream. Dear NEC, Would you please verify this as well, if possible? Best Regards, Masaki Furuta
Patch series updated for additional of tracing of ppc64 binaries. Fix in CVS.
Patch series updated for additional fixlets in test suite and error reporting, and overzealous syscall detection on s390, introduced in past fix, removed. Fix in CVS.
Dear Furuta-san, >>Dear NEC, >> >>Would you please verify this as well, if possible? I confirmed this problem is fixed on server updated ltrace-0.5-syscall_p.patch.
Dear NEC, Sorry for bothering you ,but As per suggestion from engineering, could you please verify our latest ltrace package again? I'm uploading ltrace-0.5-13.45svn.el5_7.4 at http://people.redhat.com/~mfuruta/.526007/ Best Regards, Masaki Furuta
Dear Furuta-San, >> Sorry for bothering you ,but As per suggestion from engineering, could you >> please verify our latest ltrace package again? I'm uploading >> ltrace-0.5-13.45svn.el5_7.4 at http://people.redhat.com/~mfuruta/.526007/ I confirmed this problem is fixed with ltrace-0.5-13.45svn.el5_7.4.
Patch series updated to not attempt to run a binary from unsupported architecture, which was a regression. This is just a nit related to ltrace start-up, and should have no impact to the tracing itself.
Created attachment 525218 [details] Additional fix for vfork Now that ltrace recognizes threads, it also triggers when vfork is called, but it mishandles it. This patch implements the fixes necessary for proper vfork support. Only verified on x86_64 as of now, I may update the fix if it turns out to be broken on other arches.
The above patch is racy, we need to re-enable the vfork breapoint in _parent_, not _leader_. The fix is trivial and already in CVS.
Dear Samukawa-san, Could you verify a new test package? Our engineer believes it is complete. http://people.redhat.com/myamazak/bz526007/ Best regards, M Yamazaki
Dear Yamazaki-San, >>Could you verify a new test package? Our engineer believes it is complete. >>http://people.redhat.com/myamazak/bz526007/ I confirmed this problem is fixed with ltrace-0.5-13.45svn.el5_7.7.
Dear Samukawa-san, Thank you for the confirmation and feedback. Regards, M Yamazaki
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1317.html