152012 – Actually existing threads are "lost", somehow

Bug 152012 - Actually existing threads are "lost", somehow

Summary: Actually existing threads are "lost", somehow

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ingo Molnar
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	152013 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-03-24 12:08 UTC by Magnus Ihse Bursie
Modified:	2007-11-30 22:07 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-04-18 19:01:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Java reproducer (943 bytes, text/java) 2005-03-30 15:15 UTC, Magnus Ihse Bursie	no flags	Details
Simple C-only reproducer (3.73 KB, text/plain) 2005-04-06 20:34 UTC, Magnus Ihse Bursie	no flags	Details
*An even simpler* C reproducer, without pthread_join** (1.87 KB, text/plain) 2005-04-07 07:58 UTC, Magnus Ihse Bursie	no flags	Details
View All

Description Magnus Ihse Bursie 2005-03-24 12:08:24 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050225 Firefox/1.0.1

Description of problem:
In the JRockit JVM, we send signals to our threads, to suspend them. In RedHat Enterprise Linux 4.0, we have experienced problems with this signaling. More specifically, pthread_kill() returns a non-zero value when called with a target thread that we *know* exists. When I attach to the process with gdb when this has happened, some thread stuff look terribly wrong. I believe something is broken in the kernel (most likely) or in pthreads/glibc.

Look at this short extract:
  3 Thread 14298032 (LWP 9995)  0x009757a2 in _dl_sysinfo_int80 () from 
/lib/ld-linux.so.2
  2 Thread 19377072 (LWP 9984)  0x009757a2 in _dl_sysinfo_int80 () from 
/lib/ld-linux.so.2
  1 Thread 1119936 (LWP 9984)  0x009757a2 in _dl_sysinfo_int80 () from 
/lib/ld-linux.so.2

Notice that *both* thread 1 (the initial thread) and thread 2 share the same LWP number! This is certainly not correct. In this case, I've just attached to the process since we failed to send a signal with pthread_kill(). The thread we failed to send a signal to was -- thread 2. More specifically, the value ot the pthread_t variable sent as first argument to pthread_kill() was exactly  19377072.

But thread 2 is definitely running, I can get a backtrace for it in gdb.

What's more, is that if I produce a thread listing with ps -eLf, I *don't* get any listing of thread 2. Or more specifically, I only get one thread listed with tid 9984, and the total number of threads listed for my process is one less than the number of threads presented by gdb. The number of threads listed in gdb is in correspondance with the JVM's knowledge of it's started threads.

Since we store both the pthread_t handle and the tid (by gettid()) for all threads we start, we know that the values listed by gdb for thread 1 (pthread_t = 1119936 , tid = 9984)is correct, however the values listed for thread 2 (pthread_t = 19377072, tid = 9984) is incorrect. The pthread_t handle is correct, but the LWP/tid is incorrect, it should really have been 20539.

This has only been shown to happen on 4-way or 8-way machines. The test that provokes the problem starts numerous amounts of threads. It seems likely to me to be some kind of race condition of some internal kernel thread table. 

We're using glibc 2.3.4-stable.

The same problem have also been spotted on Fedora Core release 3 (Heidelberg), Linux version 2.6.10-1.770_FC3smp (bhcompile.redhat.com) (gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)) #1 SMP Thu Feb 24 14:20:06 EST 2005, but on no other Linux distros or kernels.

I am currently also investigating a similar problem on RHEL40 on IA64; it seems possible but not certain to be the same problem. This too also happens only on computers with at least 4 CPU's. This can indicate that the problem is cross-platform.

Version-Release number of selected component (if applicable):
2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 EST 2005 i686 i686 i386 GNU/Linux, on Red Hat Enterprise Linux AS release 4 (Nahant)

How reproducible:
Always

Steps to Reproduce:
I am working on trying to write a simpler reproducer, but I have not yet succeeded. My attempts at reproducers have been very simplistic, but in the JRockit case we're doing a lot of thread stuff, some of which seems to be needed to reproduce this problem. I'll continue working on getting a simpler reproducer after the Easter holiday.


Additional info:

Comment 1 Magnus Ihse Bursie 2005-03-24 12:13:17 UTC

*** Bug 152013 has been marked as a duplicate of this bug. ***

Comment 2 Magnus Ihse Bursie 2005-03-30 15:14:04 UTC

Writing a simple reproducer turned out to be really hard. We're obviously doing
something special in JRockit that I don't fully understand, and that's needed to
reproduce this problem.

I have confirmed that the problem occurs with the latest officially released
JRockit. Since this version is a bit sloppy on the error checking on
pthread_kill(), it'll continue past a failed pthread_kill() into a
sigwaitinfo(), waiting for a signal coming back from the the thread we wanted to
send pthread_kill() to first (the suspendee). Since there is no such thread
anymore, we'll effectively just hang on this sigwaitinfo().

To reproduce:
1) Download JRockit 5.0 from
http://download2.bea.com/pub/jrockit/50/jrockit-jdk1.5.0-linux-ia32.bin
2) Install JRockit (run the downloaded file).
3) Compile ParHello.java:
$ /path/to/jrockit/bin/javac ParHello.java
4) Start JRockit with the reproducer, on a 4-way (or more) i686 machine:
$ /path/to/jrockit/bin/java ParHello
5) Now you'll get something like:
0 threads started
512 threads started
1024 threads started
1536 threads started
2048 threads started
... etc
6) Wait. After less than 15 minutes, the output should just stop.
7) Problem is reproduced. To check, attach with gdb, do a "info threads" and
watch how the last two threads (typically) have the same LWP.

Comment 3 Magnus Ihse Bursie 2005-03-30 15:15:40 UTC

Created attachment 112455 [details]
Java reproducer

Comment 4 Magnus Ihse Bursie 2005-03-30 15:34:47 UTC

I have also found out the following:

* I was previously incorrect when I said that the "disappeared" thread was still
running - the backtrace that gdb showed me was that of the main thread. 

* I copied the definition of struct pthread from the glibc sources into JRockit.
When casting the pthread_t handle to a (struct pthread*), I note some
interesting facts:

  - When we set up the newly started thread, I check the "tid" field in struct
pthread. It is indeed the same value as gettid() returns.

  - When I later on want to suspend the thread, and it has "disappeared", then
the tid field is 0.

You should be able to confirm this yourselves: when the problem has been
reproduced as described above, the "disappeared" thread's pthread_t handle is
still shown in the gdb thread list.

* By instrumenting the JRockit code and adding test code, I also found out the
following:

  - If the thread "disappear", it happens just before I want to suspend it the
first time.

  - The time that has passed from the thread was started (or at least, from
where I was able to store the system time) until I wanted to suspend it, and
thus discovered that the "tid" field was 0, is typically 8-11 ms.

  - Just as a test, I added code to make all threads send a signal to themselves
at startup time. This didn't change the behaviour. 

* To me, it looks like the thread is abruptely killed by the kernel. I base this
theory on the following:

  - The problem doesn't seem to have anything to do with signaling.

  - ps says this thread does not exist.

  - Not only JRockit, but pthread too seems to believe the thread should still
exist. (I assume that gdb gets it's list of threads for "info threads" from
pthread).

  - The tid field is set to 0. As far as I can tell from the glibc source, the
only reason for the tid to be set to 0 is that glibc calls set_tid_address on
the "tid" field and sets the CLONE_CHILD_CLEARTID flag. As I understand, this
will cause the kernel to write 0 to the tid field when the thread dies.

  - The thread has been started correctly (since it's filling in debug data such
as startup time).

  - The thread has disappeared after less than ca 10 ms. It has not correctly
run the shutdown code.

I can find no information about any suspicious kernel activity in the syslog.

I apologize for not being able to boil this down to a simpler reproducer. I
really hope you can have a look at it nevertheless, since it's a complete
blocker for us running on RH Enterprise Linux 4.0.

Comment 5 Bob Johnson 2005-04-01 13:43:06 UTC

Need a simple C test case here to work with.

Comment 6 Magnus Ihse Bursie 2005-04-06 20:34:11 UTC

Created attachment 112779 [details]
Simple C-only reproducer

Compile with:
gcc -g -Wall -lpthread -o rhel_broken_signals rhel_broken_signals.c

Reproduces problem in fractions of a second.

Comment 7 Magnus Ihse Bursie 2005-04-06 20:39:06 UTC

Alright, now I have a working reproducer in plain C, the complete program being
about 160 lines long only.

Reproduction steps:
1) Compile reproducer:
host$ gcc -g -Wall -lpthread -o rhel_broken_signals rhel_broken_signals.c
2) Run reproducer:
host$ ./rhel_broken_signals
3) Expected result:

Starting run series 0: ..........CJ

Starting run series 1: ..........CJ

Starting run series 2: ..........CJ

Starting run series 3: ..........CJ

Starting run series 4: ..........CJ

Starting run series 5: ..........CJ
etc.

Actual result, on RetHat Enterprise Linux 4.0 i386:

Starting run series 0: ..........CJ

Starting run series 1: ....Segmentation fault (core dumped)

or (more seldom, about 1/4 of the time)

Starting run series 0: ..........CJ

Starting run series 1: ....self getspecific failed 0xab1660 != 0x804a1a8!
Segmentation fault (core dumped)

Comment 8 Magnus Ihse Bursie 2005-04-06 20:49:18 UTC

I discovered that the crucial point was that a thread need to cause a SIGSEGV,
in combination with pthread functions. After that discovery, a C reproducer was
quite trivial.

After writing the reproducer, I was struck by the thought that perhaps any
signal sent to the thread itself would do. I modified the reproducer, and made
it much simpler by just doing pthread_kill on the thread itself. It reproduces
just as good. You can try this version by defining USE_PTHREAD_KILL in my
reproducer.

I kept the original version for two reasons: completeness, and the spectacular
way it sometimes makes pthread_getspecific return nonsensical values. 

In the beginning I started 100 threads at the time. It turned out so many was
not needed to reproduce the problem. I now define THREAD_COUNT as 10. If you
have problem reproducing, try increasing back to 100. If I decrease THREAD_COUNT
down to 1 i *still* get the crash, but it takes a dozen of runs, i.e. several
seconds.

Comment 9 Magnus Ihse Bursie 2005-04-06 21:17:20 UTC

And oh, btw, I tested it on a SMP Itanium, and it crashes just as fast there. On
the other hand, a single-cpu Itanium seems immune. I've not tested on a
single-CPU x86 machine, that's left as an excercise for the RedHat bug fixing team.

But my guess is that's the bug is platform-independent but only occurs on
multi-CPU machines.

Comment 10 Roland McGrath 2005-04-06 21:37:41 UTC

It is not kosher to call pthread_join with a detached thread, and that alone
could explain the segv crashes.  Please try a test program that conforms to POSIX.
(pthread_join is not required to give you a nice ESRCH return, it is invoking
undefined behavior to pass it a detached thread.)  That does not explain the
pthread_get_specific results, but I'd like to see a test program using only
valid behavior that reproduces the problem.

sigaction is process-wide, so you should not call it inside each thread.
Your call to restore the old action could restore it to SIG_DFL and cause the
original fault to give you a core dump, given the right race.

The USE_PTHREAD_KILL version of the test looks like it's valid except for the
use of pthread_join.  Since that scenario is simpler (not involving setcontext
et al), please use the pthread_kill version of the corrected test and show how
the problem manifests there.

Comment 11 Magnus Ihse Bursie 2005-04-06 22:07:58 UTC

And you're quite happy with this causing a crash in pthread_join, even though it
doesn't happen a) on other kernel versions (including older RedHat releases), or
b) if the thread in question does not cause segfaults and/or sends a signal to
itself, or c) if run on a non-SMP machine? Even if it's not "kosher"? Doesn't
that kind of smell "race" to you? At least a faint whiff?

We'll look into how to change the reproducer to suit your taste, but I still
think  the reproducer, as it is now, shows clearly that something fishy is going on.

Comment 12 Magnus Ihse Bursie 2005-04-07 07:58:51 UTC

Created attachment 112801 [details]
An even *simpler* C reproducer, without pthread_join

Compile with:
gcc -g -Wall -lpthread -o rhel_broken_signals_with_cancel
rhel_broken_signals_with_cancel.c

Comment 13 Magnus Ihse Bursie 2005-04-07 08:06:05 UTC

So, now, there you are.

This reproducer just focuses on the simple pthread_kill() version. I've replaced
the call to pthread_join() with a call to pthread_cancel(). And pthread_cancel()
*is* indeed valid to call, as far as I can tell from the man pages distributed
with RHEL40 and opengroup.org.

This at least gives you some concrete reproducer. But I'm certain that either a)
this bug shows itself in many other occasions, or b) that you have several,
similar bugs caused by race conditions.

Comment 14 Roland McGrath 2005-04-08 16:52:44 UTC

The test case from comment #12 is still invalid.  You are creating detached
threads that return quickly, and then trying to use their pthread_t's.
If a thread is detached and you are not positive through other means that it is
still running (e.g. having it blocked in a mutex or barrier or infinite loop of
some kind), then you cannot use the pthread_t *in any way*, certainly not in
pthread_cancel.  The POSIX standard says the same things about pthread_cancel
and it says about pthread_join in this regard: you cannot rely on getting ESRCH
for a bad thread handle (though the implementation is free to diagnose it that
way)--you are invoking undefined behavior by passing such a pthread_t value.

If the bug exists as you describe it, then there is no need to use detached
threads in your test.  Why don't you use a test that does not detach the
threads?  Or, one where the threads block rather than returning?

Comment 15 David Simms 2005-04-08 17:08:16 UTC

Opened 154221 in relation to this problem, may be the actually problem here.

Comment 16 Bob Johnson 2005-04-14 15:45:08 UTC

suggest we close this one and focus on just 154221 ?

Comment 17 David Simms 2005-04-15 12:25:11 UTC

Sure, fine by us, 154221 deals with the iret fault (do_exit), 154972 deals with
what caused the fault with a case repro.

Comment 18 Suzanne Hillman 2005-04-18 19:01:56 UTC

Closing as per comment #17.

Note You need to log in before you can comment on or make changes to this bug.