Description of problem: Since installing fc3test2 i have began to experience lockups when running java programs in kaffe (the free jvm). It seems to be related to running a bunch of java threads. The reason that I'm blaming glibc is that downgrading to glibc-2.3.3-36 solves the problems. Version-Release number of selected component (if applicable): glibc-2.3.3-53 and -55 How reproducible: not totally deterministic, but a simple test program from the kaffe regression test suite locks up in 9 cases out of 10. Steps to Reproduce: 1. fresh fc2test1 install 2. fetch kaffe from cvs (http://kaffe.org/anoncvs.shtml) 3. run ./configure; make check Actual results: When the actual tests start it locks with the last line printed being PASS: ThreadInterrupt.java (that means that the next test, ThreadState.java has locked up) Expected results: 0-2 failures and no lockups. Additional info: Is there any repository with old (between fc3test1 and present) development glibc rpms somewhere? If so I could do some binary searching trying to find the change that caused this problem
Try ftp://sunsite.mff.cuni.cz/private/oldglibcs/ Only glibc-2*.i686.rpms there, you can test them e.g. by rpm2cpio | cpio -id'ing them into separate directory trees and during tests point LD_LIBRARY_PATH there. I have tried to build CVS kaffe on my x86-64, but it didn't work at all.
The change is between 2.3.3-47 and 2.3.3-53. Testing with cpio was really simple and nice :) The build problems x86-64 is probably a sign that you need to switch hardware with me :P
Added -48, -50 and -52 to the above URL.
Good. The problem occurs between -47 and -48. Anything more I can do? If you have .src.rpm's laying around somewhere I should be able to isolate the specific change and perhaps understand it :)
Uploaded. If it is really the pthread_cond_* fix that breaks kaffe, it might very well be a kaffe bug too. If it is a glibc bug and you could create a small self-contained testcase reproducing the problem, it would make things far easier to understand and fix.
Ok, I have been able to reproduce this on i686 box. So far it looks far more likely to be a kaffe bug than glibc bug, but that's just the feeling. The thing is, if I LD_PRELOAD a small library which does int pthread_cond_destroy (void *x) { return 0; } (that's what was pthread_cond_destroy doing up to -47), it works even with current glibc. But even current pthread_cond_destroy does nothing if there are no pending waiters at the time pthread_cond_destroy is called (well, it acquires/releases the internal lock but that's it) nor any show up after it has been called. The POSIX standard requires that pthread_cond_destroy can be only called when there are no pending waiters. What the patch was changing is when there were condvar waiters, some thread signalled or broadcasted them, but before they were scheduled pthread_cond_destroy is called.
Ok, have proof that it's kaffe's fault: #0 0xb75eb782 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0xb751015e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 #2 0xb750dab3 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #3 0xb75ccd4c in jcondvar_wait (cv=0x8319a98, mux=0x8319a80, timeout=-1) at lock-impl.c:67 x/12w 0x8319a98 0x8319a98: 0x00000002 0x0000002c 0xffffffff 0xffffffff 0x8319aa8: 0x00000016 0x00000000 0x00000016 0x00000000 0x8319ab8: 0x08319a80 0x00000000 0x00000000 0x00000000 pthread_cond_t's __data.__total_seq field is only set to -1LL in pthread_cond_destroy, while holding condvar's internal lock. The above thread is waiting for the internal lock on entry to the pthread_cond_wait function, which means pthread_cond_destroy has been called on this condvar before this pthread_cond_wait call. But that is a bug. Kaffe should be fixed to only pthread_cond_destroy on condvars which are no longer actively used.
This problem is fixed in kaffe cvs as of 04-10-01. Big thanks to Jakub for finding this problem.