133638 – thread related lockups when running kaffe

Bug 133638 - thread related lockups when running kaffe

Summary: thread related lockups when running kaffe

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-09-25 14:22 UTC by Noa Resare
Modified:	2007-11-30 22:10 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-25 21:19:32 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Noa Resare 2004-09-25 14:22:06 UTC

Description of problem:

Since installing fc3test2 i have began to experience lockups when
running java programs in kaffe (the free jvm). It seems to be related
to running a bunch of java threads.

The reason that I'm blaming glibc is that downgrading to
glibc-2.3.3-36 solves the problems.

Version-Release number of selected component (if applicable):
glibc-2.3.3-53 and -55

How reproducible:
not totally deterministic, but a simple test program from the kaffe
regression test suite locks up in 9 cases out of 10.

Steps to Reproduce:
1. fresh fc2test1 install
2. fetch kaffe from cvs (http://kaffe.org/anoncvs.shtml)
3. run ./configure; make check
  
Actual results:
When the actual tests start it locks with the last line printed being
PASS: ThreadInterrupt.java
(that means that the next test, ThreadState.java has locked up)

Expected results:
0-2 failures and no lockups.

Additional info:
Is there any repository with old (between fc3test1 and present)
development glibc rpms somewhere? If so I could do some binary
searching trying to find the change that caused this problem

Comment 1 Jakub Jelinek 2004-09-25 16:03:12 UTC

Try ftp://sunsite.mff.cuni.cz/private/oldglibcs/
Only glibc-2*.i686.rpms there, you can test them e.g. by rpm2cpio | cpio -id'ing
them into separate directory trees and during tests point LD_LIBRARY_PATH there.

I have tried to build CVS kaffe on my x86-64, but it didn't work at all.

Comment 2 Noa Resare 2004-09-25 16:27:32 UTC

The change is between 2.3.3-47 and 2.3.3-53. Testing with cpio was
really simple and nice :)

The build problems x86-64 is probably a sign that you need to switch
hardware with me :P

Comment 3 Jakub Jelinek 2004-09-25 16:52:09 UTC

Added -48, -50 and -52 to the above URL.

Comment 4 Noa Resare 2004-09-25 17:57:45 UTC

Good. The problem occurs between -47 and -48. Anything more I can do?

If you have .src.rpm's laying around somewhere I should be able to
isolate the specific change and perhaps understand it :)

Comment 5 Jakub Jelinek 2004-09-25 18:21:51 UTC

Uploaded.  If it is really the pthread_cond_* fix that breaks kaffe,
it might very well be a kaffe bug too.
If it is a glibc bug and you could create a small self-contained testcase
reproducing the problem, it would make things far easier to understand and fix.

Comment 6 Jakub Jelinek 2004-09-25 21:00:32 UTC

Ok, I have been able to reproduce this on i686 box.
So far it looks far more likely to be a kaffe bug than glibc bug,
but that's just the feeling.

The thing is, if I LD_PRELOAD a small library which does
int pthread_cond_destroy (void *x) { return 0; }
(that's what was pthread_cond_destroy doing up to -47), it works even
with current glibc.
But even current pthread_cond_destroy does nothing if there are no
pending waiters at the time pthread_cond_destroy is called (well, it
acquires/releases the internal lock but that's it) nor any show up after it has been called.
The POSIX standard requires that pthread_cond_destroy can be only
called when there are no pending waiters.
What the patch was changing is when there were condvar waiters, some
thread signalled or broadcasted them, but before they were scheduled
pthread_cond_destroy is called.

Comment 7 Jakub Jelinek 2004-09-25 21:19:32 UTC

Ok, have proof that it's kaffe's fault:
#0  0xb75eb782 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0xb751015e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2  0xb750dab3 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#3  0xb75ccd4c in jcondvar_wait (cv=0x8319a98, mux=0x8319a80, timeout=-1) at lock-impl.c:67
x/12w 0x8319a98
0x8319a98:      0x00000002      0x0000002c      0xffffffff      0xffffffff
0x8319aa8:      0x00000016      0x00000000      0x00000016      0x00000000
0x8319ab8:      0x08319a80      0x00000000      0x00000000      0x00000000

pthread_cond_t's __data.__total_seq field is only set to -1LL in pthread_cond_destroy,
while holding condvar's internal lock.  The above thread is waiting
for the internal lock on entry to the pthread_cond_wait function,
which means pthread_cond_destroy has been called on this condvar
before this pthread_cond_wait call.  But that is a bug.

Kaffe should be fixed to only pthread_cond_destroy on condvars which
are no longer actively used.

Comment 8 Noa Resare 2004-10-04 19:17:27 UTC

This problem is fixed in kaffe cvs as of 04-10-01. Big thanks to Jakub
for finding this problem.

Note You need to log in before you can comment on or make changes to this bug.