Bug 209873

Summary: broken strace/gdb of threaded programs
Product: [Fedora] Fedora Reporter: David Woodhouse <dwmw2>
Component: kernelAssignee: Roland McGrath <roland>
Status: CLOSED RAWHIDE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: davej, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-10-28 05:40:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test program
none
utrace fix none

Description David Woodhouse 2006-10-07 09:42:56 UTC
When traced threaded programs exit, the tracing program doesn't seem to notice.


20103 pts/18   S+     0:00  |           \_ strace -f /usr/sbin/openpbx -vvvvdnc
20128 pts/18   Zl+    0:00  |               \_ [openpbx] <defunct>

29765 pts/4    S      0:00  |           \_ gdb --args /usr/sbin/openpbx -vvvvdnc
29766 pts/4    Zl+    0:00  |               \_ [openpbx] <defunct>

Comment 1 David Woodhouse 2006-10-11 14:03:56 UTC
[New Thread 844313792 (LWP 9809)]
    -- Executing V110("mISDN/1-u1", "") in new stack
[Thread 844313792 (LWP 9809) exited]
reading register pc (#64): No such process.
(gdb) c
Continuing.
reading register pc (#64): No such process.

WTF?

Comment 2 Roland McGrath 2006-10-12 03:11:40 UTC
Report omits kernel info.

Comment 3 David Woodhouse 2006-10-12 06:12:43 UTC
Current FC6 kernel on the architecture indicated above:

Linux pegasos.infradead.org 2.6.18-1.2741.fc6 #1 Wed Oct 4 20:18:10 EDT 2006 ppc
ppc ppc GNU/Linux

Also on ppc64 kernel. Here's what I see when 'strace -f' observes a threaded
program exiting...

[pid  9385] write(1, "Setting timer 268505240 for 5-se"..., 51Setting timer
268505240 for 5-second expiration...
) = 51
[pid  9385] timer_settime(0, 0, {it_interval={5, 0}, it_value={5, 0}}, NULL) = 0
[pid  9385] exit_group(1)               = ?
[pid  9386] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid  9386] SYS_300(0x30033510, 0xc, 0x30033508, 0x3003a940, 0x30033508) = 0
[pid  9386] rt_sigtimedwait([RTMIN], 


 <unfinished ...>
Process 9386 detached

[1]+  Stopped                 strace -f ./sigev_thread
[root@pegasos dwmw2]# kill -9 %1
[1]+  Killed                  strace -f ./sigev_thread




Comment 4 Roland McGrath 2006-10-12 09:22:04 UTC
I can only test ppc64 kernels.  I tried my vanilla 2.6.18+utrace ppc64 on an
otherwise fc5 ppc/ppc64 installation, and strace -f worked fine.  I still don't
have your actual test case, so I used a trivial multithreaded program of my own.
Please attach your test program source so I can try what you tried.

I have a lot of downloading and installing to do before I can test that fc6
kernel, or test any kernel in an fc6 environment.

Comment 5 David Woodhouse 2006-10-12 10:12:46 UTC
Created attachment 138320 [details]
test program

This is sufficient.

Comment 6 Roland McGrath 2006-10-12 10:23:35 UTC
Ok, I did reproduce some weirdness with 2.6.18+utrace on ppc64 using your test.
It looks like the bug is specifically with a group exit that should be killing
many live threads, which did not happen in my trivial test.

Comment 7 Roland McGrath 2006-10-12 10:45:16 UTC
Same problem on i386, it's not machine-specific, just this test case.
I'll figure it out.

Comment 8 Roland McGrath 2006-10-12 21:01:10 UTC
Created attachment 138377 [details]
utrace fix

This fixes the test case.  I am also looking into other utrace interactions
with SIGKILL.