Bug 526007 - ltrace cannot properly handle multi-threaded processes
Summary: ltrace cannot properly handle multi-threaded processes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: ltrace
Version: 5.3
Hardware: All
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Petr Machata
QA Contact: Miloš Prchlík
URL:
Whiteboard:
Depends On:
Blocks: 668957 719046 733216 742340 878756 978304
TreeView+ depends on / blocked
 
Reported: 2009-09-28 09:51 UTC by Martin Osvald 🛹
Modified: 2018-11-27 21:41 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-30 22:39:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
the reproducer's sources (410 bytes, text/plain)
2009-09-28 09:51 UTC, Martin Osvald 🛹
no flags Details
temporary patch (25.41 KB, patch)
2009-09-28 09:53 UTC, Martin Osvald 🛹
no flags Details | Diff
Source RPM with rebased patchset (215.67 KB, application/x-rpm)
2011-07-29 11:19 UTC, Petr Machata
no flags Details
Source RPM with rebased patchset (217.33 KB, application/x-rpm)
2011-08-01 19:14 UTC, Petr Machata
no flags Details
Source RPM with rebased patchset (218.49 KB, application/x-rpm)
2011-08-02 18:16 UTC, Petr Machata
no flags Details
Source RPM with rebased patchset (219.46 KB, application/x-rpm)
2011-08-04 20:36 UTC, Petr Machata
no flags Details
Additional compilation fix for ia64 (717 bytes, patch)
2011-08-05 13:39 UTC, Petr Machata
no flags Details | Diff
Partial fix (2.68 KB, patch)
2011-08-18 02:04 UTC, Petr Machata
no flags Details | Diff
Additional fix for attach (8.10 KB, patch)
2011-08-19 17:24 UTC, Petr Machata
no flags Details | Diff
Additional fix for detach (9.10 KB, patch)
2011-08-20 21:14 UTC, Petr Machata
no flags Details | Diff
Additional fix (3.97 KB, patch)
2011-09-09 19:35 UTC, Petr Machata
no flags Details | Diff
Additional fix for vfork (26.74 KB, patch)
2011-09-27 21:38 UTC, Petr Machata
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Legacy) 42871 0 None None None Never
Red Hat Product Errata RHBA-2013:1317 0 normal SHIPPED_LIVE ltrace bug fix update 2013-09-30 21:13:19 UTC

Description Martin Osvald 🛹 2009-09-28 09:51:04 UTC
Created attachment 362871 [details]
the reproducer's sources

Description of problem:

ltrace doesn't properly handle threads, so that kernel sents SIGTRAP signal to traced process with default disposition (SIG_DFL), which leads to premature process termination.

The real reason of this bug is that the linux kernel thinks that we don't want to trace a thread, because of wrong/unimplemented ptrace option (PTRACE_O_TRACECLONE). See the section `Additional info' for investigation results.


Version-Release number of selected component (if applicable):

RHEL5 (0.5-13.45svn.el5), also on RHEL4 and Fedora 11


How reproducible:

almost always


Steps to Reproduce:

1. Build the attached reproducer:

  $ gcc -o reproducer reproducer.c -lpthread

2. Run the reproducer (or run it repeatedly till you see the SIGTRAP):

  $ ltrace ./reproducer

 
Actual results:

$ ltrace ./reproducer
__libc_start_main(0x4006b3, 1, 0x7fff7ce47e78, 0x400710, 0x400700 <unfinished ...>
pthread_create(0x7fff7ce47d88, 0, 0x40065c, 0, 0x3959569300 <unfinished ...>
+++ killed by SIGTRAP +++
$


Expected results:

ltrace shouldn't be killed


Additional info:

If we look at ltrace's debug output using command line option `-dd':

=== snip ===
...
DEBUG: process_event.c:109: process_event(): event: syscall (SYS_clone [56])     *** 1)
DEBUG: breakpoints.c:128: disable_all_breakpoints(): Disabling breakpoints for pid 32352... *** 2)
DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400508)
DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400518)
DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400528)
DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400538)
DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400548)
DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400558)
DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x400599)
DEBUG: breakpoint.c:55: disable_breakpoint(): disable_breakpoint(32352,0x4006cf)
DEBUG: process_event.c:115: process_event(): event: sysret (SYS_clone [56])      *** 3)
DEBUG: process_event.c:109: process_event(): event: syscall (SYS_futex [202])    *** 4)
DEBUG: breakpoints.c:109: enable_all_breakpoints(): Enabling breakpoints for pid 32352... *** 5)
DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400508)
DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400518)
DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400528)
DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400538)
DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400548)
DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400558)
DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x400599)
DEBUG: breakpoint.c:25: enable_breakpoint(): enable_breakpoint(32352,0x4006cf)
DEBUG: process_event.c:103: process_event(): event: exit signal (SIGTRAP [5])    *** 6)
+++ killed by SIGTRAP +++
DEBUG: process_event.c:162: remove_proc(): Removing pid 32352
...
=== end of snip ===

We can see in 1) entering clone syscall, thus removing all breakpoints 2) from the program, because ltrace does checking for system calls that may create childs and removes breakpoints before letting a traced program to continue. Ltrace wrongly considers there a thread as a child, but this is due to a previous implementation of threads - linuxthreads, where threads were de facto distinct processes. But this isn't the problem itself.

3) show us the return from pthread_create() and 4) calling futex syscall means waiting in pthread_join() for notiffication from kernel about thread termination. The problem arises during 5) when all breakpoints in traced program are turned on again due to previous futex syscall, because syscall isn't considered as system call that can create childs. You can find this check/bug in process_event()->process_syscall()->fork_p().

6) The problem itself, the created thread stepped onto breakpoint, we can see this from produced coredump:

Core was generated by `./reproducer'.         
Program terminated with signal 5, Trace/breakpoint trap.
#0  0x0000000000400509 in printf@plt ()                 

(gdb) bt                                                                 
#0  0x0000000000400509 in printf@plt ()                                  
#1  0x00000000004006a0 in print_tid () at reproducer.c:16                
#2  0x0000003959e0686a in start_thread () from /lib64/libpthread.so.0    
#3  0x00000039592de25d in clone () from /lib64/libc.so.6                 
#4  0x0000000000000000 in ?? ()                                          

(gdb) x/i 0x0000000000400509
0x400509 <printf@plt+1>:        and    $0x2005aa,%eax
(gdb) x/i 0x0000000000400508
0x400508 <printf@plt>:  int3   

(gdb) disas 0x400508
Dump of assembler code for function printf@plt:
0x0000000000400508 <printf@plt+0>:      int3   
0x0000000000400509 <printf@plt+1>:      and    $0x2005aa,%eax
0x000000000040050e <printf@plt+6>:      pushq  $0x0          
0x0000000000400513 <printf@plt+11>:     jmpq   0x4004f8
End of assembler dump.

(gdb) info registers
rax            0x0      0
rbx            0x0      0
rcx            0x22e    558
rdx            0x3d33   15667
rsi            0x22e    558
rdi            0x4007f8 4196344
rbp            0x7f7f6ad7a100   0x7f7f6ad7a100
rsp            0x7f7f6ad7a0e8   0x7f7f6ad7a0e8
r8             0x4007fe 4196350
r9             0x1      1
r10            0x22     34
r11            0x202    514
r12            0x3959e10d80     246321057152
r13            0x0      0
r14            0x0      0
r15            0x3      3
rip            0x400509 0x400509 <printf@plt+1>
eflags         0x297    [ CF PF AF SF IF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
fctrl          0x37f    895
fstat          0x0      0
ftag           0xffff   65535
fiseg          0x0      0
fioff          0x0      0
foseg          0x0      0
fooff          0x0      0
fop            0x0      0
mxcsr          0x1f80   [ IM DM ZM OM UM PM ]
(gdb)

If we dig into kernel sources we can see, as I mentioned earlier, that the debugged thread isn't set in kernel to be traced, so that kernel kills it with SIGTRAP and disposition SIG_DFL.


In the latest ltrace version from git, this is fixed/implemented already, you can find first mention about this at:

http://git.debian.org/?p=collab-maint/ltrace.git;a=commit;h=2721e6adbc7399e1b621b896b4f5b864082fbf03

but it really works from this commit:

http://git.debian.org/?p=collab-maint/ltrace.git;a=commit;h=6be8028492b29f68126ce413b7ef4b875c7bb017

I have created a patch based on the diff-comits with small changes to preserve naming convention of older sources from RHEL5 (+ removing some parts of previous patches (now useless) + fixing compilation errors). The patch fixes this bug and also passes all the tests (make check).


Another problem still persists even with the latest sources, the ltrace's author haven't implemented yet the support for having multiple threads sharing the same code with breakpoints inserted there:

http://lists.alioth.debian.org/pipermail/ltrace-devel/2006-April/000035.html
http://lists.alioth.debian.org/pipermail/ltrace-devel/2006-April/000036.html

But it seems that he is planning to fix this in a near future:

http://git.debian.org/?p=collab-maint/ltrace.git;a=blob;f=TODO;h=5063737726b1f1ecdfe59159b722f0807497f8f6;hb=163765561621a9127cee3243dd94011e70a2c580

Comment 1 Martin Osvald 🛹 2009-09-28 09:53:43 UTC
Created attachment 362874 [details]
temporary patch

Comment 8 Thom Carlin 2010-07-22 17:20:50 UTC
Also occurs on Fedora Core 13

Comment 11 RHEL Program Management 2010-08-09 19:50:03 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 16 Petr Machata 2010-09-01 12:21:14 UTC
Confirmed that the upstream version fixes this issue.  I'll look into how difficult would it be to port this over to RHEL (or rather the other way around, to merge our patches upstream; note we cannot simply rebase, ltrace hasn't been able to pass its own testsuite for quite some time now).

Comment 17 Michal Nowak 2010-10-08 06:52:27 UTC
Just a note, upstream started 0.6 release and is looking for ltrace patches laying all around. Seems like a good time to merge RHEL & Fedora patches upstream.

Comment 37 RHEL Program Management 2011-05-31 15:40:47 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 43 Petr Machata 2011-07-29 11:19:32 UTC
Created attachment 515874 [details]
Source RPM with rebased patchset

This src rpm includes the complete patch set.  I'll spin scratch binaries next and link them here.  This works on Fedora 14 on x86_64, I haven't had a chance to check this anywhere else yet.  Test case is included.

Comment 45 Petr Machata 2011-08-01 19:14:08 UTC
Created attachment 516201 [details]
Source RPM with rebased patchset

On RHEL 5, process status of stopped process is always reported as "T" in /proc/pid/status, regardless of whether it's in tracing or job control stop.  The status is clarified further on the line.  This SRPM contains an additional patch that implements this.  Multi-threaded tracing now works for me on x86_64 RHEL 5 machine.  Scratch build for x86_64 here:
https://brewweb.devel.redhat.com/taskinfo?taskID=3530264

Comment 46 Masaki Furuta ( RH ) 2011-08-02 03:23:25 UTC
(In reply to comment #45)
> Created attachment 516201 [details]
> Source RPM with rebased patchset
> 
> On RHEL 5, process status of stopped process is always reported as "T" in
> /proc/pid/status, regardless of whether it's in tracing or job control stop. 
> The status is clarified further on the line.  This SRPM contains an additional
> patch that implements this.  Multi-threaded tracing now works for me on x86_64
> RHEL 5 machine.

Dear Tanaka-san,

Would you please rebuild attached SRPM on x86_64 and verify it, if possible?

Best Regards,
Masaki Furuta

Comment 47 Petr Machata 2011-08-02 11:59:07 UTC
Verified that the same SRPM works also on i386, ppc and ppc64.

Comment 48 Petr Machata 2011-08-02 18:16:29 UTC
Created attachment 516376 [details]
Source RPM with rebased patchset

This includes an extra patch that fixes a race between waiting for SIGSTOP to be delivered to all tasks, and one of those tasks exiting.

Furthermore, I verified that multi-threaded racing also works on s390 and s390x.

Comment 49 Petr Machata 2011-08-04 20:36:27 UTC
Created attachment 516786 [details]
Source RPM with rebased patchset

This SRPM contains one additional correctness fix and two race condition fixes.

Comment 55 Petr Machata 2011-08-05 13:39:08 UTC
Created attachment 516896 [details]
Additional compilation fix for ia64

Comment 56 Kenichi TANAKA 2011-08-08 05:50:46 UTC
Hi Furuta-san,

NEC confirmed the fix with the test package. 

Thanks,
Ken'ichi Tanaka

Comment 64 Petr Machata 2011-08-18 02:04:02 UTC
Created attachment 518786 [details]
Partial fix

This is partial fix that removes a race between obtaining list of tasks and attaching to those tasks.  This fixes SIGSEGVs that ltrace was getting and copious "can't attach" messages that it was putting out when it tried to attach to processes with many (100-ish) threads.

I still see a different attach problem, where the tracee dies of SIGTRAP when ltrace attempts to attach.  Similarly, detach problems were not addressed yet.

Comment 65 Petr Machata 2011-08-19 17:24:48 UTC
Created attachment 519070 [details]
Additional fix for attach

This is additional patch that amends the previous "Partial fix" and fixes attach to many-threaded processes.  The gist of the addressed problem is that some tasks were still running unattached while the breakpoints were already inserted.

Comment 66 Petr Machata 2011-08-20 21:14:49 UTC
Created attachment 519156 [details]
Additional fix for detach

This is additional patch that fixes a number of bugs in detach logic that would leave tracee in inconsistent state (with pending events, instruction pointers pointing mid-instruction, breakpoints left inside, and perhaps others).  As a result, the tracee would be killed, or could generally change behavior as the result of tracing.

I have ltrace with this fix spinning on my Fedora machine, attaching and detaching to a multi-threaded process, and it seems to be stable and well-behaved, at least on x86_64.  I'll do more thorough cross-arch testing on Monday.

Comment 71 Petr Machata 2011-09-09 19:35:48 UTC
Created attachment 522392 [details]
Additional fix

Additional fix for proper syscall detection, ported from upstream.

Comment 72 Masaki Furuta ( RH ) 2011-09-13 10:47:42 UTC
(In reply to comment #71)
> Created attachment 522392 [details]
> Additional fix
> 
> Additional fix for proper syscall detection, ported from upstream.

Dear NEC,

Would you please verify this as well, if possible?

Best Regards,
Masaki Furuta

Comment 73 Petr Machata 2011-09-14 20:28:40 UTC
Patch series updated for additional of tracing of ppc64 binaries.  Fix in CVS.

Comment 74 Petr Machata 2011-09-15 16:21:12 UTC
Patch series updated for additional fixlets in test suite and error reporting, and overzealous syscall detection on s390, introduced in past fix, removed.  Fix in CVS.

Comment 75 Osamu Samukawa 2011-09-16 03:58:41 UTC
Dear Furuta-san,

>>Dear NEC,
>>
>>Would you please verify this as well, if possible?

I confirmed this problem is fixed on server updated ltrace-0.5-syscall_p.patch.

Comment 77 Masaki Furuta ( RH ) 2011-09-20 05:53:11 UTC
Dear NEC,

Sorry for bothering you ,but As per suggestion from engineering, could you please verify our latest ltrace package again? I'm uploading ltrace-0.5-13.45svn.el5_7.4 at http://people.redhat.com/~mfuruta/.526007/

Best Regards,
Masaki Furuta

Comment 78 Osamu Samukawa 2011-09-21 05:21:38 UTC
Dear Furuta-San,

>> Sorry for bothering you ,but As per suggestion from engineering, could you
>> please verify our latest ltrace package again? I'm uploading
>> ltrace-0.5-13.45svn.el5_7.4 at http://people.redhat.com/~mfuruta/.526007/

I confirmed this problem is fixed with ltrace-0.5-13.45svn.el5_7.4.

Comment 79 Petr Machata 2011-09-21 16:44:46 UTC
Patch series updated to not attempt to run a binary from unsupported architecture, which was a regression.  This is just a nit related to ltrace start-up, and should have no impact to the tracing itself.

Comment 80 Petr Machata 2011-09-27 21:38:34 UTC
Created attachment 525218 [details]
Additional fix for vfork

Now that ltrace recognizes threads, it also triggers when vfork is called, but it mishandles it.  This patch implements the fixes necessary for proper vfork support.  Only verified on x86_64 as of now, I may update the fix if it turns out to be broken on other arches.

Comment 81 Petr Machata 2011-10-03 18:23:07 UTC
The above patch is racy, we need to re-enable the vfork breapoint in _parent_, not _leader_.  The fix is trivial and already in CVS.

Comment 84 Masayoshi Yamazaki 2011-10-12 09:18:27 UTC
Dear Samukawa-san,

Could you verify a new test package? Our engineer believes it is complete.
http://people.redhat.com/myamazak/bz526007/

Best regards,
M Yamazaki

Comment 85 Osamu Samukawa 2011-10-12 11:25:40 UTC
Dear Yamazaki-San,

>>Could you verify a new test package? Our engineer believes it is complete.
>>http://people.redhat.com/myamazak/bz526007/

I confirmed this problem is fixed with ltrace-0.5-13.45svn.el5_7.7.

Comment 86 Masayoshi Yamazaki 2011-10-12 11:55:45 UTC
Dear Samukawa-san,

Thank you for the confirmation and feedback.

Regards,
M Yamazaki

Comment 108 errata-xmlrpc 2013-09-30 22:39:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1317.html


Note You need to log in before you can comment on or make changes to this bug.