Description of problem: Artifical testcase lockups strace if: Thread 0 (thread group leader) stays in pause(2). Thread 1 stays in pause(2). Thread 2 calls exit_group(2). Version-Release number of selected component (if applicable): CVS snapshot of 2007-07-11 with strace.c revision 1.81. How reproducible: Always. Steps to Reproduce: 1. gcc -o test/leaderkill2 test/leaderkill2.c -Wall -ggdb2 -pthread 2. ./test/leaderkill2 & pid=$!;sleep 1;./strace -f -p $pid Actual results: [pid 27502] nanosleep({1, 0}, {1, 0}) = 0 [pid 27502] exit_group(42) = ? Process 27502 detached - hang Expected results: Successful finish and it must print: write(1, "OK\n", ... Additional info: Similiar problem as in a recent GDB Bug 247354 with a patch posted as: http://sources.redhat.com/ml/gdb-patches/2007-07/msg00136.html I did not find a better solution there as it cannot much modify the inferior's state / events processing. For STRACE I would choose to: * Detach all the threads from TCBTAB before the leader one. As if the task is already running we cannot safely stop-and-wait it as we cannot find a difference between still-running and already-zombie task as both return ESRCH. Reading /proc/PID/status if it is zombie also means a race. This Bug is a followup on Roland's mail text: On Wed, 11 Jul 2007 10:40:48 +0200, Roland McGrath wrote: [snip] I am still concerned about other cases where there are more threads. I think that the synchronous wait in detach will bite again on the leader because the other threads still exist. They should be killed by the group exit, but they will still stick around as zombies until we see them with wait because they are ptraced. I think that is enough to prevent the zombie leader from being reported to wait. So it would be good to investigate some more cases. If I'm right about that case, then I think the right solution is simply to punt the detach call on the leader in handle_group_exit. It should be seen shortly along with all the other threads. But I may be overlooking something, some reason that detach was there other than ancient kernels. [snip]
Created attachment 159013 [details] Testcase (`leaderkill2.c').
Created attachment 159827 [details] patch does not work Please follow up on the mailing list about this. I tried the obvious patch and it did not make a happy strace for this test.
Created attachment 160505 [details] patch does work Compared to the Attachment 159827 [details] there is missing the last part - leader->flags |= TCB_GROUP_EXITING; but I do not see much reasons to do it there, I expect you did just a cleanup of the code.
committed upstream
Fixed in Rawhide strace-4.5.16-1.fc8: * Fri Aug 3 2007 Roland McGrath <roland> - 4.5.16-1 - fix multithread issues ([...], #247907) and upstream: 2007-08-02 Jan Kratochvil <jan.kratochvil> * strace.c (detach): Moved the resume notification code to ... (resume_from_tcp): ... a new function here. (handle_group_exit): No longer detach also the thread group leader. (trace): Fixed panic on exit of the TCB_GROUP_EXITING leader itself. Fixes RH#247907. * test/leaderkill.c (start): Renamed to ... (start0): ... here. (start1): New function. (main): Created a new spare thread.
strace-4.5.16-1.fc7 has been pushed to the Fedora 7 stable repository. If problems still persist, please make note of it in this bug report.