247907 – [cvs] Lockup on exit_group() by the non-leader of 3 threads

Bug 247907 - [cvs] Lockup on exit_group() by the non-leader of 3 threads

Summary: [cvs] Lockup on exit_group() by the non-leader of 3 threads

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	strace
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Roland McGrath
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	222053
TreeView+	depends on / blocked

Reported:	2007-07-11 22:22 UTC by Jan Kratochvil
Modified:	2007-11-30 22:12 UTC (History)
CC List:	0 users
Fixed In Version:	4.5.16-1.fc7
Clone Of:
Environment:
Last Closed:	2007-08-06 17:59:36 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Testcase (`leaderkill2.c'). (1.25 KB, text/plain) 2007-07-11 22:22 UTC, Jan Kratochvil	no flags	Details
patch does not work (1.58 KB, patch) 2007-07-24 02:15 UTC, Roland McGrath	no flags	Details \| Diff
patch does work (7.35 KB, patch) 2007-08-02 08:50 UTC, Jan Kratochvil	no flags	Details \| Diff
Show Obsolete (1) View All

Description Jan Kratochvil 2007-07-11 22:22:33 UTC

Description of problem:
Artifical testcase lockups strace if:
Thread 0 (thread group leader) stays in pause(2).
Thread 1 stays in pause(2).
Thread 2 calls exit_group(2).

Version-Release number of selected component (if applicable):
CVS snapshot of 2007-07-11 with strace.c revision 1.81.

How reproducible:
Always.

Steps to Reproduce:
1. gcc -o test/leaderkill2 test/leaderkill2.c -Wall -ggdb2 -pthread
2. ./test/leaderkill2 & pid=$!;sleep 1;./strace -f -p $pid

Actual results:
[pid 27502] nanosleep({1, 0}, {1, 0})   = 0
[pid 27502] exit_group(42)              = ?
Process 27502 detached
- hang

Expected results:
Successful finish and it must print: write(1, "OK\n", ...

Additional info:
Similiar problem as in a recent GDB Bug 247354 with a patch posted as:
  http://sources.redhat.com/ml/gdb-patches/2007-07/msg00136.html
I did not find a better solution there as it cannot much modify the inferior's
state / events processing.

For STRACE I would choose to:
 * Detach all the threads from TCBTAB before the leader one.
As if the task is already running we cannot safely stop-and-wait it as we cannot
find a difference between still-running and already-zombie task as both return
ESRCH.  Reading /proc/PID/status if it is zombie also means a race.


This Bug is a followup on Roland's mail text:
On Wed, 11 Jul 2007 10:40:48 +0200, Roland McGrath wrote:
[snip]
I am still
concerned about other cases where there are more threads.  I think that the
synchronous wait in detach will bite again on the leader because the other
threads still exist.  They should be killed by the group exit, but they
will still stick around as zombies until we see them with wait because they
are ptraced.  I think that is enough to prevent the zombie leader from
being reported to wait.  So it would be good to investigate some more
cases.  If I'm right about that case, then I think the right solution is
simply to punt the detach call on the leader in handle_group_exit.  It
should be seen shortly along with all the other threads.  But I may be
overlooking something, some reason that detach was there other than ancient
kernels.
[snip]

Comment 1 Jan Kratochvil 2007-07-11 22:22:33 UTC

Created attachment 159013 [details]
Testcase (`leaderkill2.c').

Comment 2 Roland McGrath 2007-07-24 02:15:36 UTC

Created attachment 159827 [details]
patch does not work

Please follow up on the mailing list about this.
I tried the obvious patch and it did not make a happy strace for this test.

Comment 3 Jan Kratochvil 2007-08-02 08:50:05 UTC

Created attachment 160505 [details]
patch does work

Compared to the Attachment 159827 [details] there is missing the last part
-				leader->flags |= TCB_GROUP_EXITING;
but I do not see much reasons to do it there, I expect you did just a cleanup
of the code.

Comment 4 Roland McGrath 2007-08-03 10:05:16 UTC

committed upstream

Comment 5 Jan Kratochvil 2007-08-03 12:03:37 UTC

Fixed in Rawhide strace-4.5.16-1.fc8:
* Fri Aug  3 2007 Roland McGrath <roland> - 4.5.16-1
- fix multithread issues ([...], #247907)

and upstream:

2007-08-02  Jan Kratochvil  <jan.kratochvil>

        * strace.c (detach): Moved the resume notification code to ...
        (resume_from_tcp): ... a new function here.
        (handle_group_exit): No longer detach also the thread group leader.
        (trace): Fixed panic on exit of the TCB_GROUP_EXITING leader itself.
        Fixes RH#247907.

        * test/leaderkill.c (start): Renamed to ...
        (start0): ... here.
        (start1): New function.
        (main): Created a new spare thread.

Comment 6 Fedora Update System 2007-08-06 17:59:14 UTC

strace-4.5.16-1.fc7 has been pushed to the Fedora 7 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.