Bug 247907

Summary: [cvs] Lockup on exit_group() by the non-leader of 3 threads
Product: [Fedora] Fedora Reporter: Jan Kratochvil <jan.kratochvil>
Component: straceAssignee: Roland McGrath <roland>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: rawhide   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 4.5.16-1.fc7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-08-06 17:59:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 222053    
Attachments:
Description Flags
Testcase (`leaderkill2.c').
none
patch does not work
none
patch does work none

Description Jan Kratochvil 2007-07-11 22:22:33 UTC
Description of problem:
Artifical testcase lockups strace if:
Thread 0 (thread group leader) stays in pause(2).
Thread 1 stays in pause(2).
Thread 2 calls exit_group(2).

Version-Release number of selected component (if applicable):
CVS snapshot of 2007-07-11 with strace.c revision 1.81.

How reproducible:
Always.

Steps to Reproduce:
1. gcc -o test/leaderkill2 test/leaderkill2.c -Wall -ggdb2 -pthread
2. ./test/leaderkill2 & pid=$!;sleep 1;./strace -f -p $pid

Actual results:
[pid 27502] nanosleep({1, 0}, {1, 0})   = 0
[pid 27502] exit_group(42)              = ?
Process 27502 detached
- hang

Expected results:
Successful finish and it must print: write(1, "OK\n", ...

Additional info:
Similiar problem as in a recent GDB Bug 247354 with a patch posted as:
  http://sources.redhat.com/ml/gdb-patches/2007-07/msg00136.html
I did not find a better solution there as it cannot much modify the inferior's
state / events processing.

For STRACE I would choose to:
 * Detach all the threads from TCBTAB before the leader one.
As if the task is already running we cannot safely stop-and-wait it as we cannot
find a difference between still-running and already-zombie task as both return
ESRCH.  Reading /proc/PID/status if it is zombie also means a race.


This Bug is a followup on Roland's mail text:
On Wed, 11 Jul 2007 10:40:48 +0200, Roland McGrath wrote:
[snip]
I am still
concerned about other cases where there are more threads.  I think that the
synchronous wait in detach will bite again on the leader because the other
threads still exist.  They should be killed by the group exit, but they
will still stick around as zombies until we see them with wait because they
are ptraced.  I think that is enough to prevent the zombie leader from
being reported to wait.  So it would be good to investigate some more
cases.  If I'm right about that case, then I think the right solution is
simply to punt the detach call on the leader in handle_group_exit.  It
should be seen shortly along with all the other threads.  But I may be
overlooking something, some reason that detach was there other than ancient
kernels.
[snip]

Comment 1 Jan Kratochvil 2007-07-11 22:22:33 UTC
Created attachment 159013 [details]
Testcase (`leaderkill2.c').

Comment 2 Roland McGrath 2007-07-24 02:15:36 UTC
Created attachment 159827 [details]
patch does not work

Please follow up on the mailing list about this.
I tried the obvious patch and it did not make a happy strace for this test.

Comment 3 Jan Kratochvil 2007-08-02 08:50:05 UTC
Created attachment 160505 [details]
patch does work

Compared to the Attachment 159827 [details] there is missing the last part
-				leader->flags |= TCB_GROUP_EXITING;
but I do not see much reasons to do it there, I expect you did just a cleanup
of the code.

Comment 4 Roland McGrath 2007-08-03 10:05:16 UTC
committed upstream

Comment 5 Jan Kratochvil 2007-08-03 12:03:37 UTC
Fixed in Rawhide strace-4.5.16-1.fc8:
* Fri Aug  3 2007 Roland McGrath <roland> - 4.5.16-1
- fix multithread issues ([...], #247907)

and upstream:

2007-08-02  Jan Kratochvil  <jan.kratochvil>

        * strace.c (detach): Moved the resume notification code to ...
        (resume_from_tcp): ... a new function here.
        (handle_group_exit): No longer detach also the thread group leader.
        (trace): Fixed panic on exit of the TCB_GROUP_EXITING leader itself.
        Fixes RH#247907.

        * test/leaderkill.c (start): Renamed to ...
        (start0): ... here.
        (start1): New function.
        (main): Created a new spare thread.


Comment 6 Fedora Update System 2007-08-06 17:59:14 UTC
strace-4.5.16-1.fc7 has been pushed to the Fedora 7 stable repository.  If problems still persist, please make note of it in this bug report.