Created attachment 320341 [details] Reproduce package Description : Several users have reported severe performance problems on X86_64 systems running RedHat Linux versions which implement ptrace using utrace. The nature of the problems indicates that it's some kind of a race condition in the kernel. The ptrace(PTRACE_CONT, ...) system call is sometimes taking as long a 2 seconds to complete. The applications which have this problem are multi-thread applications, and the problem occurs when TotalView is restarting all of the threads after a breakpoint stop. Mostly, we see this problem with programs in which one of the threads is loading a bunch of shared libraries using dlopen. Typically, ptrace(PTRACE_CONT, ...) completes fairly quickly on the first thread started, but gets slower and slower as more threads are started. The problem is very racy - the total time taken to complete our test program varies from under a minute to 7 minutes or so. Attempts to measure what is going on have been disappointing - when we use a profiling tool to try to see what's going on, the problem appears to disappear. If we turn on some of TotalView's internal debug information (with timestamps) the problem seems to disappear. If we run strace on TotalView, the problem can be seen - that's where the information about the time taken by ptrace(PTRACE_CONT, ...) comes from. Running the same test program under gdb does not appear to see the same problem - program completion time is a couple of seconds. However, if we use strace to measure the performance of gdb, the problem suddenly appears - the completion time jumps from a couple of seconds to a couple of minutes! This might be related to RH Bugzilla bug #455060, see particularly comments #25, and #30. This is a very high priority bug from our point of view. Several mutual customers have reported problems that we suspect are due to the same root cause. This is TotalView internal bug #11556. Environment : Red Hat Enterprise Linux Server release 5.2 (Tikanga) How-To-Repeat: Use the attached program. Type "make", then use the shell file strace_gdb_run.sh to run gdb under strace. The program measures the time taken to load all of the shared libraries, and prints that out at the end. Typically we see numbers around 140 seconds for an strace run and 1-2 seconds for a non-strace run (see run-gdb-multiple.sh) (on a single processor X86_64 machine). If you do not see these results, the number of threads started can be varied by using the -nthreads <n> option to the program. The default value is 3, which means that 3 subsidiary threads (which will run a tight compute-bound loop) are started. The main thread then loads and unloads 50 shared libraries, and exits. > cat /etc/*-rel* cat: /etc/lsb-release.d: Is a directory Red Hat Enterprise Linux Server release 5.2 (Tikanga) > uname -a Linux rhel52-x8664.totalviewtech.com 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux > tar -zxf dlopen_test.tar.gz > make > ./strace_gdb_run.sh > less strace_gdb.res Aggregate Time: 247.150151 sec > ./run-gdb-multiple.sh 1 > less dlout.gdb Aggregate Time: 0.979023 sec => when the same command 'gdb -x run.gdb -q dlopen_test' is run under strace, it takes 2 magnitudes longer. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
In addition to RH 5, one our mutual customers sees this problem on FC 5. So, it would be very important to have this fixed in FC as well. They are planning on moving to FC 8 soon.
update: just letting you know ... the bug is under attention, I'm still investigating.
I was not able to get the time, mentioned by Seppo, for './strace_gdb_run.sh'. 20 seconds is absolute record for me. I'm seeing that ptrace(PTRACE_CONT, ... ) takes too long from time to time. Obviously, we have race condition, unfortunately I was not able to identify it yet. The problem is much, much more visible with nosmp option passed to kernel.
Created attachment 325373 [details] sysrq-t output for related tasks sysrq-t output for related tasks, at the moment when race condition happen. collected w/ nosmp option.
CLOSED as WONTFIX due to way too much efforts required, incomparable with bug implication. With change of the user-visible behaviour.