Description of problem: When debugging (using e.g. Totalview), one fails to attach to a pthread, causing the pthread to run loose and debugging session to fail. This is very likely a kernel problem (details below). This is a new problem to RedHat ES 5, and does not happen in e.g. Redhat ES4 (Update 4). We think that this is a critical bug in the core functionality of the kernel as more and more users are moving to ES 5. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Version-Release number of selected component (if applicable): less /etc/redhat-release Red Hat Enterprise Linux Server release 5 (Tikanga) less /proc/version Linux version 2.6.18-8.el5 (brewbuilder.redhat.com) (gcc version 4.1.1 2 0070105 (Red Hat 4.1.1-52)) #1 SMP Fri Jan 26 14:15:14 EST 2007 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX How reproducible: Does not happen always, only on roughly every tenth attempt. The odds of failure are increased, if the system is under load. I recommend running e.g. yes > /dev/null & in the background in order to increase the failure rate. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Steps to Reproduce: 1. Compile and link using Intel Compilers: /home/compilers/intel/c++/10.0.026/x86_64-linux/bin/icc -DPTHREADS_TOO -g -lpthread test_threads.cxx 2. launch under Totalview: tv82 ./a.out 3. Just hit 'Go' in TV. And compare the outcome to examples below. The error message in TV is: ERROR: Process 13693 failed to stop when we attached to it. Wait result = -1, timeout flag = 1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Actual results: -- Under TV: Master threads_and_clones process 13685, with parent 13601 Flags to be passed to clone: 'CLONE_VM' 13685: Created clone 1: clone_pid == 13686 13685: Created clone 2: clone_pid == 13687 13685: Created clone 3: clone_pid == 13688 13685: Created clone 4: clone_pid == 13689 Process 13685, ptid = 46912496231200 in make_pthreads 13685: Created 1075841344, me_index=5. 13685: Created 1077942592, me_index=6. 13685: Created 1080043840, me_index=7. ERROR: Process 13693 failed to stop when we attached to it. Wait result = -1, timeout flag = 1 13685: Created 1082145088, me_index=8. Master thread returned from creation barrier OOPS -- tx_threads_and_clones thread 1 timed out waiting for the lock OOPS -- tx_threads_and_clones thread 4 timed out waiting for the lock OOPS -- tx_threads_and_clones thread 2 timed out waiting for the lock OOPS -- tx_threads_and_clones thread 6 timed out waiting for the lock XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Expected results: -- Under TVD: Master threads_and_clones process 13658, with parent 13601 Flags to be passed to clone: 'CLONE_VM' 13658: Created clone 1: clone_pid == 13659 13658: Created clone 2: clone_pid == 13660 13658: Created clone 3: clone_pid == 13661 13658: Created clone 4: clone_pid == 13662 Process 13658, ptid = 46912496231200 in make_pthreads 13658: Created 1075841344, me_index=5. 13658: Created 1077942592, me_index=6. 13658: Created 1080043840, me_index=7. 13658: Created 1082145088, me_index=8. Master thread returned from creation barrier OOPS -- tx_threads_and_clones thread 1 timed out waiting for the lock OOPS -- tx_threads_and_clones thread 2 timed out waiting for the lock OOPS -- tx_threads_and_clones thread 3 timed out waiting for the lock OOPS -- tx_threads_and_clones thread 4 timed out waiting for the lock OOPS -- tx_threads_and_clones thread 5 timed out waiting for the lock XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Additional info: Our internal bug # is 10024. More details below: 1) After looking at this in some detail, I'm convinced it is a kernel failure. What appears to happen is that TV attempts to attach to a thread, the attach is apparently successful, the thread stops, but ptrace for some reason will not let you perform operations on the thread. The error you get is ESRCH, which means: The specified process does not exist, or is not currently being traced by the caller, or is not stopped (for requests that require that). The process does exist, somebody is attached to it (a further attempt to attach will fail with EPERM), and the thread is stopped. The TotalView error we see is: ERROR: Process 26435 failed to stop when we attached to it. Wait result = -1, timeout flag = 1 We expect to see an event, and we never saw one. We tried to perform an operation of the thread and failed, and we assume that's because the thread isn't yet stopped, so we wait for it to stop. Apparently the thread was stopped (so we never get an event), and the operation failure was for some other reason. 2) Also, this is NOT a problem e.g. on RedHat ES4 (testing using the same compilers) less /etc/redhat-release Red Hat Enterprise Linux AS release 4 (Nahant Update 4) less /proc/version Linux version 2.6.9-42.0.2.ELsmp (bhcompile.redhat.com) (gcc version 3.4 .6 20060404 (Red Hat 3.4.6-3)) #1 SMP Thu Aug 17 17:57:31 EDT 2006 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Created attachment 194741 [details] Reproducer code
Confirming `totalview.8.2.0-1-linux-x86-64' reproduces the problem on RHEL-5.0: ERROR: Process 13693 failed to stop when we attached to it. Wait result = -1, (it is an ESRCH error from a PTRACE_CONT syscall) The minimal testcase (`manythreads.c') to be run by TotalView is a GCC-compiled multiple pthread_create()s calling code. (Reduced the testcase from the original ICC-compiled `test_threads.cxx'.) Testcase to reproduce the problem of TotalView without TotalView is `attachstopped.c'. It is a duplicate of the Bug 233540 Comment 0. The ESRCH error should be fixed in the RHEL-5.1 kernel (Bug 233540). (There is no guarantee of anything as the RHEL-5.1 product is still not gold.) The problem fixed is the behavior of wait4() after PTRACE_ATTACH on a process which has been already Stopped (T) before PTRACE_ATTACH. Workaround for RHEL-5.0 is the attached `fixkill.c' to be LD_PRELOADed. # gcc -o ./fixkill.so fixkill.c -Wall -ggdb2 -shared -fPIC # LD_PRELOAD=$PWD/fixkill.so nice -n20 ./totalview-8.2.0-1/toolworks/totalview.8.2.0-1/bin/totalviewcli ./manythreads While it does not fix the Stopped(T)/PTRACE_ATTACH/wait4() problem it fixes the more general problem of TotalView using kill(2) instead of tkill(2)/tgkill(2). If tkill(2) is used the tasks do not get accidentally Stopped (T) and the RHEL-5.0 incompatible kernel behavior does not get invoked. Still `totalview.8.2.0-1-linux-x86-64' does not behave right for me either on RHEL-4 or on the upstream kernels, tested on: kernel-2.6.9-55.EL.x86_64 OS RHEL-4.5 kernel-vanilla-2.6.21-1.3190.fc7.i686 OS RHEL-5.1 beta On these systems TVD hangs (suspends displaying the new threads, it still communicates with the user). Please note the testcase `manythreads.c' calls pthread_create() _8_ times in total but there are only 2 created threads displayed below: ------------------------------------------------------------------------------ # gcc -o ./manythreads ./manythreads.c -Wall -ggdb2 -pthread # totalview-8.2.0-1/toolworks/totalview.8.2.0-1/bin/totalviewcli ./manythreads ... d1.<> drun Thread 1.1 has appeared Created process 1 (3774), named "manythreads" Thread 1.1 has appeared Thread 1.1 has exited d1.<> Thread 1.2 has appeared Thread 1.3 has appeared d1.<> dstatus 1 (3774) Running [./manythreads] 1.1 (3774/3086657216) Running PC=0x4bee9f80 1.2 (3774/3086654352) Running PC=0xb7fb5410 1.3 (3774/3078261648) Running PC=0x00000000 d1.<> quit Do you really wish to exit TotalView? y Thread 1.9 has appeared Thread 1.8 has appeared Thread 1.7 has appeared Thread 1.6 has appeared Thread 1.5 has appeared Thread 1.4 has appeared Process 1 has exited At the `dstatus' time all the threads already run: # ls -l /proc/3774/task/ total 0 dr-xr-xr-x 4 root root 0 Sep 22 12:32 3774/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3775/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3776/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3777/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3778/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3779/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3780/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3781/ dr-xr-xr-x 4 root root 0 Sep 22 12:32 3782/ # _ ------------------------------------------------------------------------------ You may see the kernel itself is aware of the threads there. kill(2) vs. tkill(2)/tgkill(2): I see there the problematic use of the kill() syscall (in the `trace0' attachment). kill() may deliver the signal to arbitrary task of the thread group while you use it there to signal specific non-leader threads for the ptrace() purposes. You should use the tkill() (or better tgkill()) syscall instead - GDB is calling it in its kill_lwp() function. The RHEL-5 utrace based ptrace has better distribution of the signals inside the thread group thus more probably hitting this TotalView's race. Still the TotalView's behavior is a race in the upstream kernels, you try to workaround it there by repeating kill()s until all the threads are collected by wait4(-1,...). Still it makes the ptrace() handling challenging and I expect some other occasional racy bugs may occur there. Any use of strace itself or other hooking tool makes the ESRCH error unreproducible, I was able to trace it while keeping reproducibility only using the attached SystemTap script `ptrace.stp'. Trace of TotalView is in the attachment `ESRCH-decoded', please search there for the first `ESRCH' error. FYI I do not understand in the TotalView traces: The first task (1679) is caught using PTRACE_TRACEME. The first PTRACE_ATTACHed task is 1682. But the tasks 1680 and 1681 are mysteriously caught using only wait4() with no ptrace(): wait4 (-1, 0x40000003) = (status 0x137f) 1680 wait4 (-1, 0x40000003) = (status 0x137f) 1681 As there is no PTRACE_SETOPTIONS (to set PTRACE_O_TRACECLONE) how could it happen? And why TotalView makes the difference between the tasks 1680..1681 (caught only by wait4()) and 1682..1687 (properly PTRACE_ATTACH + wait4()ed)? It is not a scope of this Bug and it may not cause any problems, though.
Created attachment 203581 [details] `manythreads.c' testcase for GCC to be fed into TotalView. gcc -o ./manythreads ./manythreads.c -Wall -ggdb2 -pthread
Created attachment 203591 [details] `attachstopped.c' testcase to triage the kernel bug without TotalView.
Created attachment 203601 [details] `fixkill.c' to turn kill(2)->tkill(2); fixes up TotalView && workarounds the RHEL-5.0 kernel bug. gcc -o ./fixkill.so fixkill.c -Wall -ggdb2 -shared -fPIC LD_PRELOAD=$PWD/fixkill.so .../bin/totalviewcli ...
Created attachment 203611 [details] Trace of the TotalView run on the RHEL-5.0 kernel; search for ESRCH there.
Created attachment 203621 [details] SystemTap nonintrusive ptrace(2) && wait4(2) tracer. stap -g -u ./ptrace.stp | tee log Other tracing methods change the timing etc. making the problem unreproducible.
*** This bug has been marked as a duplicate of 233540 ***