Bug 289411
Description
Seppo Sahrakorpi
2007-09-13 16:18:08 UTC
Created attachment 194741 [details]
Reproducer code
Confirming `totalview.8.2.0-1-linux-x86-64' reproduces the problem on RHEL-5.0: ERROR: Process 13693 failed to stop when we attached to it. Wait result = -1, (it is an ESRCH error from a PTRACE_CONT syscall) The minimal testcase (`manythreads.c') to be run by TotalView is a GCC-compiled multiple pthread_create()s calling code. (Reduced the testcase from the original ICC-compiled `test_threads.cxx'.) Testcase to reproduce the problem of TotalView without TotalView is `attachstopped.c'. It is a duplicate of the Bug 233540 Comment 0. The ESRCH error should be fixed in the RHEL-5.1 kernel (Bug 233540). (There is no guarantee of anything as the RHEL-5.1 product is still not gold.) The problem fixed is the behavior of wait4() after PTRACE_ATTACH on a process which has been already Stopped (T) before PTRACE_ATTACH. Workaround for RHEL-5.0 is the attached `fixkill.c' to be LD_PRELOADed. # gcc -o ./fixkill.so fixkill.c -Wall -ggdb2 -shared -fPIC # LD_PRELOAD=$PWD/fixkill.so nice -n20 ./totalview-8.2.0-1/toolworks/totalview.8.2.0-1/bin/totalviewcli ./manythreads While it does not fix the Stopped(T)/PTRACE_ATTACH/wait4() problem it fixes the more general problem of TotalView using kill(2) instead of tkill(2)/tgkill(2). If tkill(2) is used the tasks do not get accidentally Stopped (T) and the RHEL-5.0 incompatible kernel behavior does not get invoked. Still `totalview.8.2.0-1-linux-x86-64' does not behave right for me either on RHEL-4 or on the upstream kernels, tested on: kernel-2.6.9-55.EL.x86_64 OS RHEL-4.5 kernel-vanilla-2.6.21-1.3190.fc7.i686 OS RHEL-5.1 beta On these systems TVD hangs (suspends displaying the new threads, it still communicates with the user). Please note the testcase `manythreads.c' calls pthread_create() _8_ times in total but there are only 2 created threads displayed below: ------------------------------------------------------------------------------ # gcc -o ./manythreads ./manythreads.c -Wall -ggdb2 -pthread # totalview-8.2.0-1/toolworks/totalview.8.2.0-1/bin/totalviewcli ./manythreads ... d1.<> drun Thread 1.1 has appeared Created process 1 (3774), named "manythreads" Thread 1.1 has appeared Thread 1.1 has exited d1.<> Thread 1.2 has appeared Thread 1.3 has appeared d1.<> dstatus 1 (3774) Running [./manythreads] 1.1 (3774/3086657216) Running PC=0x4bee9f80 1.2 (3774/3086654352) Running PC=0xb7fb5410 1.3 (3774/3078261648) Running PC=0x00000000 d1.<> quit Do you really wish to exit TotalView? y Thread 1.9 has appeared Thread 1.8 has appeared Thread 1.7 has appeared Thread 1.6 has appeared Thread 1.5 has appeared Thread 1.4 has appeared Process 1 has exited At the `dstatus' time all the threads already run: # ls -l /proc/3774/task/ total 0 dr-xr-xr-x 4 root root 0 Sep 22 12:32 3774/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3775/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3776/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3777/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3778/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3779/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3780/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3781/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3782/ # _ ------------------------------------------------------------------------------ You may see the kernel itself is aware of the threads there. kill(2) vs. tkill(2)/tgkill(2): I see there the problematic use of the kill() syscall (in the `trace0' attachment). kill() may deliver the signal to arbitrary task of the thread group while you use it there to signal specific non-leader threads for the ptrace() purposes. You should use the tkill() (or better tgkill()) syscall instead - GDB is calling it in its kill_lwp() function. The RHEL-5 utrace based ptrace has better distribution of the signals inside the thread group thus more probably hitting this TotalView's race. Still the TotalView's behavior is a race in the upstream kernels, you try to workaround it there by repeating kill()s until all the threads are collected by wait4(-1,...). Still it makes the ptrace() handling challenging and I expect some other occasional racy bugs may occur there. Any use of strace itself or other hooking tool makes the ESRCH error unreproducible, I was able to trace it while keeping reproducibility only using the attached SystemTap script `ptrace.stp'. Trace of TotalView is in the attachment `ESRCH-decoded', please search there for the first `ESRCH' error. FYI I do not understand in the TotalView traces: The first task (1679) is caught using PTRACE_TRACEME. The first PTRACE_ATTACHed task is 1682. But the tasks 1680 and 1681 are mysteriously caught using only wait4() with no ptrace(): wait4 (-1, 0x40000003) = (status 0x137f) 1680 wait4 (-1, 0x40000003) = (status 0x137f) 1681 As there is no PTRACE_SETOPTIONS (to set PTRACE_O_TRACECLONE) how could it happen? And why TotalView makes the difference between the tasks 1680..1681 (caught only by wait4()) and 1682..1687 (properly PTRACE_ATTACH + wait4()ed)? It is not a scope of this Bug and it may not cause any problems, though. Created attachment 203581 [details]
`manythreads.c' testcase for GCC to be fed into TotalView.
gcc -o ./manythreads ./manythreads.c -Wall -ggdb2 -pthread
Created attachment 203591 [details]
`attachstopped.c' testcase to triage the kernel bug without TotalView.
Created attachment 203601 [details]
`fixkill.c' to turn kill(2)->tkill(2); fixes up TotalView && workarounds the RHEL-5.0 kernel bug.
gcc -o ./fixkill.so fixkill.c -Wall -ggdb2 -shared -fPIC
LD_PRELOAD=$PWD/fixkill.so .../bin/totalviewcli ...
Created attachment 203611 [details]
Trace of the TotalView run on the RHEL-5.0 kernel; search for ESRCH there.
Created attachment 203621 [details]
SystemTap nonintrusive ptrace(2) && wait4(2) tracer.
stap -g -u ./ptrace.stp | tee log
Other tracing methods change the timing etc. making the problem unreproducible.
|