Bug 152013

Summary: Actually existing threads are "lost", somehow
Product: Red Hat Enterprise Linux 4 Reporter: Magnus Ihse Bursie <redhat.10.ihsebea>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: jbaron, pfrields, riel
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-03-24 12:12:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Magnus Ihse Bursie 2005-03-24 12:11:02 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050225 Firefox/1.0.1

Description of problem:
In the JRockit JVM, we send signals to our threads, to suspend them. In RedHat Enterprise Linux 4.0, we have experienced problems with this signaling. More specifically, pthread_kill() returns a non-zero value when called with a target thread that we *know* exists. When I attach to the process with gdb when this has happened, some thread stuff look terribly wrong. I believe something is broken in the kernel (most likely) or in pthreads/glibc.

Look at this short extract:
  3 Thread 14298032 (LWP 9995)  0x009757a2 in _dl_sysinfo_int80 () from 
/lib/ld-linux.so.2
  2 Thread 19377072 (LWP 9984)  0x009757a2 in _dl_sysinfo_int80 () from 
/lib/ld-linux.so.2
  1 Thread 1119936 (LWP 9984)  0x009757a2 in _dl_sysinfo_int80 () from 
/lib/ld-linux.so.2

Notice that *both* thread 1 (the initial thread) and thread 2 share the same LWP number! This is certainly not correct. In this case, I've just attached to the process since we failed to send a signal with pthread_kill(). The thread we failed to send a signal to was -- thread 2. More specifically, the value ot the pthread_t variable sent as first argument to pthread_kill() was exactly  19377072.

But thread 2 is definitely running, I can get a backtrace for it in gdb.

What's more, is that if I produce a thread listing with ps -eLf, I *don't* get any listing of thread 2. Or more specifically, I only get one thread listed with tid 9984, and the total number of threads listed for my process is one less than the number of threads presented by gdb. The number of threads listed in gdb is in correspondance with the JVM's knowledge of it's started threads.

Since we store both the pthread_t handle and the tid (by gettid()) for all threads we start, we know that the values listed by gdb for thread 1 (pthread_t = 1119936 , tid = 9984)is correct, however the values listed for thread 2 (pthread_t = 19377072, tid = 9984) is incorrect. The pthread_t handle is correct, but the LWP/tid is incorrect, it should really have been 20539.

This has only been shown to happen on 4-way or 8-way machines. The test that provokes the problem starts numerous amounts of threads. It seems likely to me to be some kind of race condition of some internal kernel thread table. 

We're using glibc 2.3.4-stable.

The same problem have also been spotted on Fedora Core release 3 (Heidelberg), Linux version 2.6.10-1.770_FC3smp (bhcompile.redhat.com) (gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)) #1 SMP Thu Feb 24 14:20:06 EST 2005, but on no other Linux distros or kernels.

I am currently also investigating a similar problem on RHEL40 on IA64; it seems possible but not certain to be the same problem. This too also happens only on computers with at least 4 CPU's. This can indicate that the problem is cross-platform.

Version-Release number of selected component (if applicable):
2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 EST 2005 i686 i686 i386 GNU/Linux, on Red Hat Enterprise Linux AS release 4 (Nahant)

How reproducible:
Always

Steps to Reproduce:
I am working on trying to write a simpler reproducer, but I have not yet succeeded. My attempts at reproducers have been very simplistic, but in the JRockit case we're doing a lot of thread stuff, some of which seems to be needed to reproduce this problem. I'll continue working on getting a simpler reproducer after the Easter holiday.


Additional info:

Comment 1 Magnus Ihse Bursie 2005-03-24 12:12:56 UTC
Sorry, made a duplicate posting.

(Weird bugzilla behaviour, didn't update my window when pressing submit, but
opened a new tab in Firefox. I didn't notice that... Please consider changing
this behaviour.)


*** This bug has been marked as a duplicate of 152012 ***