Bug 247907 - [cvs] Lockup on exit_group() by the non-leader of 3 threads
[cvs] Lockup on exit_group() by the non-leader of 3 threads
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: strace (Show other bugs)
rawhide
All Linux
low Severity medium
: ---
: ---
Assigned To: Roland McGrath
Fedora Extras Quality Assurance
:
Depends On:
Blocks: 222053
  Show dependency treegraph
 
Reported: 2007-07-11 18:22 EDT by Jan Kratochvil
Modified: 2007-11-30 17:12 EST (History)
0 users

See Also:
Fixed In Version: 4.5.16-1.fc7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-08-06 13:59:36 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Testcase (`leaderkill2.c'). (1.25 KB, text/plain)
2007-07-11 18:22 EDT, Jan Kratochvil
no flags Details
patch does not work (1.58 KB, patch)
2007-07-23 22:15 EDT, Roland McGrath
no flags Details | Diff
patch does work (7.35 KB, patch)
2007-08-02 04:50 EDT, Jan Kratochvil
no flags Details | Diff

  None (edit)
Description Jan Kratochvil 2007-07-11 18:22:33 EDT
Description of problem:
Artifical testcase lockups strace if:
Thread 0 (thread group leader) stays in pause(2).
Thread 1 stays in pause(2).
Thread 2 calls exit_group(2).

Version-Release number of selected component (if applicable):
CVS snapshot of 2007-07-11 with strace.c revision 1.81.

How reproducible:
Always.

Steps to Reproduce:
1. gcc -o test/leaderkill2 test/leaderkill2.c -Wall -ggdb2 -pthread
2. ./test/leaderkill2 & pid=$!;sleep 1;./strace -f -p $pid

Actual results:
[pid 27502] nanosleep({1, 0}, {1, 0})   = 0
[pid 27502] exit_group(42)              = ?
Process 27502 detached
- hang

Expected results:
Successful finish and it must print: write(1, "OK\n", ...

Additional info:
Similiar problem as in a recent GDB Bug 247354 with a patch posted as:
  http://sources.redhat.com/ml/gdb-patches/2007-07/msg00136.html
I did not find a better solution there as it cannot much modify the inferior's
state / events processing.

For STRACE I would choose to:
 * Detach all the threads from TCBTAB before the leader one.
As if the task is already running we cannot safely stop-and-wait it as we cannot
find a difference between still-running and already-zombie task as both return
ESRCH.  Reading /proc/PID/status if it is zombie also means a race.


This Bug is a followup on Roland's mail text:
On Wed, 11 Jul 2007 10:40:48 +0200, Roland McGrath wrote:
[snip]
I am still
concerned about other cases where there are more threads.  I think that the
synchronous wait in detach will bite again on the leader because the other
threads still exist.  They should be killed by the group exit, but they
will still stick around as zombies until we see them with wait because they
are ptraced.  I think that is enough to prevent the zombie leader from
being reported to wait.  So it would be good to investigate some more
cases.  If I'm right about that case, then I think the right solution is
simply to punt the detach call on the leader in handle_group_exit.  It
should be seen shortly along with all the other threads.  But I may be
overlooking something, some reason that detach was there other than ancient
kernels.
[snip]
Comment 1 Jan Kratochvil 2007-07-11 18:22:33 EDT
Created attachment 159013 [details]
Testcase (`leaderkill2.c').
Comment 2 Roland McGrath 2007-07-23 22:15:36 EDT
Created attachment 159827 [details]
patch does not work

Please follow up on the mailing list about this.
I tried the obvious patch and it did not make a happy strace for this test.
Comment 3 Jan Kratochvil 2007-08-02 04:50:05 EDT
Created attachment 160505 [details]
patch does work

Compared to the Attachment 159827 [details] there is missing the last part
-				leader->flags |= TCB_GROUP_EXITING;
but I do not see much reasons to do it there, I expect you did just a cleanup
of the code.
Comment 4 Roland McGrath 2007-08-03 06:05:16 EDT
committed upstream
Comment 5 Jan Kratochvil 2007-08-03 08:03:37 EDT
Fixed in Rawhide strace-4.5.16-1.fc8:
* Fri Aug  3 2007 Roland McGrath <roland@redhat.com> - 4.5.16-1
- fix multithread issues ([...], #247907)

and upstream:

2007-08-02  Jan Kratochvil  <jan.kratochvil@redhat.com>

        * strace.c (detach): Moved the resume notification code to ...
        (resume_from_tcp): ... a new function here.
        (handle_group_exit): No longer detach also the thread group leader.
        (trace): Fixed panic on exit of the TCB_GROUP_EXITING leader itself.
        Fixes RH#247907.

        * test/leaderkill.c (start): Renamed to ...
        (start0): ... here.
        (start1): New function.
        (main): Created a new spare thread.
Comment 6 Fedora Update System 2007-08-06 13:59:14 EDT
strace-4.5.16-1.fc7 has been pushed to the Fedora 7 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.