From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050225 Firefox/1.0.1 Description of problem: In the JRockit JVM, we send signals to our threads, to suspend them. In RedHat Enterprise Linux 4.0, we have experienced problems with this signaling. More specifically, pthread_kill() returns a non-zero value when called with a target thread that we *know* exists. When I attach to the process with gdb when this has happened, some thread stuff look terribly wrong. I believe something is broken in the kernel (most likely) or in pthreads/glibc. Look at this short extract: 3 Thread 14298032 (LWP 9995) 0x009757a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 2 Thread 19377072 (LWP 9984) 0x009757a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 1 Thread 1119936 (LWP 9984) 0x009757a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 Notice that *both* thread 1 (the initial thread) and thread 2 share the same LWP number! This is certainly not correct. In this case, I've just attached to the process since we failed to send a signal with pthread_kill(). The thread we failed to send a signal to was -- thread 2. More specifically, the value ot the pthread_t variable sent as first argument to pthread_kill() was exactly 19377072. But thread 2 is definitely running, I can get a backtrace for it in gdb. What's more, is that if I produce a thread listing with ps -eLf, I *don't* get any listing of thread 2. Or more specifically, I only get one thread listed with tid 9984, and the total number of threads listed for my process is one less than the number of threads presented by gdb. The number of threads listed in gdb is in correspondance with the JVM's knowledge of it's started threads. Since we store both the pthread_t handle and the tid (by gettid()) for all threads we start, we know that the values listed by gdb for thread 1 (pthread_t = 1119936 , tid = 9984)is correct, however the values listed for thread 2 (pthread_t = 19377072, tid = 9984) is incorrect. The pthread_t handle is correct, but the LWP/tid is incorrect, it should really have been 20539. This has only been shown to happen on 4-way or 8-way machines. The test that provokes the problem starts numerous amounts of threads. It seems likely to me to be some kind of race condition of some internal kernel thread table. We're using glibc 2.3.4-stable. The same problem have also been spotted on Fedora Core release 3 (Heidelberg), Linux version 2.6.10-1.770_FC3smp (bhcompile.redhat.com) (gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)) #1 SMP Thu Feb 24 14:20:06 EST 2005, but on no other Linux distros or kernels. I am currently also investigating a similar problem on RHEL40 on IA64; it seems possible but not certain to be the same problem. This too also happens only on computers with at least 4 CPU's. This can indicate that the problem is cross-platform. Version-Release number of selected component (if applicable): 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 EST 2005 i686 i686 i386 GNU/Linux, on Red Hat Enterprise Linux AS release 4 (Nahant) How reproducible: Always Steps to Reproduce: I am working on trying to write a simpler reproducer, but I have not yet succeeded. My attempts at reproducers have been very simplistic, but in the JRockit case we're doing a lot of thread stuff, some of which seems to be needed to reproduce this problem. I'll continue working on getting a simpler reproducer after the Easter holiday. Additional info:
*** Bug 152013 has been marked as a duplicate of this bug. ***
Writing a simple reproducer turned out to be really hard. We're obviously doing something special in JRockit that I don't fully understand, and that's needed to reproduce this problem. I have confirmed that the problem occurs with the latest officially released JRockit. Since this version is a bit sloppy on the error checking on pthread_kill(), it'll continue past a failed pthread_kill() into a sigwaitinfo(), waiting for a signal coming back from the the thread we wanted to send pthread_kill() to first (the suspendee). Since there is no such thread anymore, we'll effectively just hang on this sigwaitinfo(). To reproduce: 1) Download JRockit 5.0 from http://download2.bea.com/pub/jrockit/50/jrockit-jdk1.5.0-linux-ia32.bin 2) Install JRockit (run the downloaded file). 3) Compile ParHello.java: $ /path/to/jrockit/bin/javac ParHello.java 4) Start JRockit with the reproducer, on a 4-way (or more) i686 machine: $ /path/to/jrockit/bin/java ParHello 5) Now you'll get something like: 0 threads started 512 threads started 1024 threads started 1536 threads started 2048 threads started ... etc 6) Wait. After less than 15 minutes, the output should just stop. 7) Problem is reproduced. To check, attach with gdb, do a "info threads" and watch how the last two threads (typically) have the same LWP.
Created attachment 112455 [details] Java reproducer
I have also found out the following: * I was previously incorrect when I said that the "disappeared" thread was still running - the backtrace that gdb showed me was that of the main thread. * I copied the definition of struct pthread from the glibc sources into JRockit. When casting the pthread_t handle to a (struct pthread*), I note some interesting facts: - When we set up the newly started thread, I check the "tid" field in struct pthread. It is indeed the same value as gettid() returns. - When I later on want to suspend the thread, and it has "disappeared", then the tid field is 0. You should be able to confirm this yourselves: when the problem has been reproduced as described above, the "disappeared" thread's pthread_t handle is still shown in the gdb thread list. * By instrumenting the JRockit code and adding test code, I also found out the following: - If the thread "disappear", it happens just before I want to suspend it the first time. - The time that has passed from the thread was started (or at least, from where I was able to store the system time) until I wanted to suspend it, and thus discovered that the "tid" field was 0, is typically 8-11 ms. - Just as a test, I added code to make all threads send a signal to themselves at startup time. This didn't change the behaviour. * To me, it looks like the thread is abruptely killed by the kernel. I base this theory on the following: - The problem doesn't seem to have anything to do with signaling. - ps says this thread does not exist. - Not only JRockit, but pthread too seems to believe the thread should still exist. (I assume that gdb gets it's list of threads for "info threads" from pthread). - The tid field is set to 0. As far as I can tell from the glibc source, the only reason for the tid to be set to 0 is that glibc calls set_tid_address on the "tid" field and sets the CLONE_CHILD_CLEARTID flag. As I understand, this will cause the kernel to write 0 to the tid field when the thread dies. - The thread has been started correctly (since it's filling in debug data such as startup time). - The thread has disappeared after less than ca 10 ms. It has not correctly run the shutdown code. I can find no information about any suspicious kernel activity in the syslog. I apologize for not being able to boil this down to a simpler reproducer. I really hope you can have a look at it nevertheless, since it's a complete blocker for us running on RH Enterprise Linux 4.0.
Need a simple C test case here to work with.
Created attachment 112779 [details] Simple C-only reproducer Compile with: gcc -g -Wall -lpthread -o rhel_broken_signals rhel_broken_signals.c Reproduces problem in fractions of a second.
Alright, now I have a working reproducer in plain C, the complete program being about 160 lines long only. Reproduction steps: 1) Compile reproducer: host$ gcc -g -Wall -lpthread -o rhel_broken_signals rhel_broken_signals.c 2) Run reproducer: host$ ./rhel_broken_signals 3) Expected result: Starting run series 0: ..........CJ Starting run series 1: ..........CJ Starting run series 2: ..........CJ Starting run series 3: ..........CJ Starting run series 4: ..........CJ Starting run series 5: ..........CJ etc. Actual result, on RetHat Enterprise Linux 4.0 i386: Starting run series 0: ..........CJ Starting run series 1: ....Segmentation fault (core dumped) or (more seldom, about 1/4 of the time) Starting run series 0: ..........CJ Starting run series 1: ....self getspecific failed 0xab1660 != 0x804a1a8! Segmentation fault (core dumped)
I discovered that the crucial point was that a thread need to cause a SIGSEGV, in combination with pthread functions. After that discovery, a C reproducer was quite trivial. After writing the reproducer, I was struck by the thought that perhaps any signal sent to the thread itself would do. I modified the reproducer, and made it much simpler by just doing pthread_kill on the thread itself. It reproduces just as good. You can try this version by defining USE_PTHREAD_KILL in my reproducer. I kept the original version for two reasons: completeness, and the spectacular way it sometimes makes pthread_getspecific return nonsensical values. In the beginning I started 100 threads at the time. It turned out so many was not needed to reproduce the problem. I now define THREAD_COUNT as 10. If you have problem reproducing, try increasing back to 100. If I decrease THREAD_COUNT down to 1 i *still* get the crash, but it takes a dozen of runs, i.e. several seconds.
And oh, btw, I tested it on a SMP Itanium, and it crashes just as fast there. On the other hand, a single-cpu Itanium seems immune. I've not tested on a single-CPU x86 machine, that's left as an excercise for the RedHat bug fixing team. But my guess is that's the bug is platform-independent but only occurs on multi-CPU machines.
It is not kosher to call pthread_join with a detached thread, and that alone could explain the segv crashes. Please try a test program that conforms to POSIX. (pthread_join is not required to give you a nice ESRCH return, it is invoking undefined behavior to pass it a detached thread.) That does not explain the pthread_get_specific results, but I'd like to see a test program using only valid behavior that reproduces the problem. sigaction is process-wide, so you should not call it inside each thread. Your call to restore the old action could restore it to SIG_DFL and cause the original fault to give you a core dump, given the right race. The USE_PTHREAD_KILL version of the test looks like it's valid except for the use of pthread_join. Since that scenario is simpler (not involving setcontext et al), please use the pthread_kill version of the corrected test and show how the problem manifests there.
And you're quite happy with this causing a crash in pthread_join, even though it doesn't happen a) on other kernel versions (including older RedHat releases), or b) if the thread in question does not cause segfaults and/or sends a signal to itself, or c) if run on a non-SMP machine? Even if it's not "kosher"? Doesn't that kind of smell "race" to you? At least a faint whiff? We'll look into how to change the reproducer to suit your taste, but I still think the reproducer, as it is now, shows clearly that something fishy is going on.
Created attachment 112801 [details] An even *simpler* C reproducer, without pthread_join Compile with: gcc -g -Wall -lpthread -o rhel_broken_signals_with_cancel rhel_broken_signals_with_cancel.c
So, now, there you are. This reproducer just focuses on the simple pthread_kill() version. I've replaced the call to pthread_join() with a call to pthread_cancel(). And pthread_cancel() *is* indeed valid to call, as far as I can tell from the man pages distributed with RHEL40 and opengroup.org. This at least gives you some concrete reproducer. But I'm certain that either a) this bug shows itself in many other occasions, or b) that you have several, similar bugs caused by race conditions.
The test case from comment #12 is still invalid. You are creating detached threads that return quickly, and then trying to use their pthread_t's. If a thread is detached and you are not positive through other means that it is still running (e.g. having it blocked in a mutex or barrier or infinite loop of some kind), then you cannot use the pthread_t *in any way*, certainly not in pthread_cancel. The POSIX standard says the same things about pthread_cancel and it says about pthread_join in this regard: you cannot rely on getting ESRCH for a bad thread handle (though the implementation is free to diagnose it that way)--you are invoking undefined behavior by passing such a pthread_t value. If the bug exists as you describe it, then there is no need to use detached threads in your test. Why don't you use a test that does not detach the threads? Or, one where the threads block rather than returning?
Opened 154221 in relation to this problem, may be the actually problem here.
suggest we close this one and focus on just 154221 ?
Sure, fine by us, 154221 deals with the iret fault (do_exit), 154972 deals with what caused the fault with a case repro.
Closing as per comment #17.