From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030529
Description of problem:
When using a multi-threaded application, it is fairly easy to get the
NPTL-based code to deadlock, whereas the same test case will run
indefinitely on linuxthreads. As a reference to verify it is not the
application, I ran the same application on an SMP solaris machine without
I have a simple test case that can be used to demonstrate the failure.
Basically, the test case looks like this:
1 thread pushing requests onto a queue
15 threads pulling requests off the queue.
Note that neither of the thread pools do anything other than allocate
and free a pointer to the work request. It takes around 30 seconds to a minute
to freeze on an IBM x255 machine (4 xeon 1.6Ghz processors with 8Gb memory and a
RAID 5 array of disks).
I already ran up2date to bring up my glibc and kernel levels. uname -a gives:
Linux ldapdut009 2.4.21-1.1931.2.389.entbigmem #1 SMP Mon Aug 11 10:12:45 EDT
2003 i686 i686 i386 GNU/Linux
And /lib/libc.so.6 gives:
GNU C Library stable release version 2.3.2, by Roland McGrath et al.
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
Compiled by GNU CC version 3.2.3 20030502 (Red Hat Linux 3.2.3-13).
Compiled on a Linux 2.4.20 system on 2003-08-12.
GNU libio by Per Bothner
crypt add-on version 2.1 by Michael Glad and others
linuxthreads-0.10 by Xavier Leroy
The C stubs add-on version 2.1.2.
NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
Glibc-2.0 compatibility add-on by Cristian Gafton
libthread_db work sponsored by Alpha Processor Inc
Thread-local storage support included.
Report bugs using the `glibcbug' script to <firstname.lastname@example.org>.
Originally I came across this problem by benchmarking our LDAP server (IBM, not
openldap) on RHEL 3, but under full CPU utilization the process would deadlock
after about 30 seconds. When I looked at the mutex that was the culprit, the
__m_owner field was set to a thread that wasn't in my process. So, being very
confused I tried to isolate the problem to thread contention for a lock by
writing the test case above (the test models our work dispatcher).
I haven't had a chance/hardware resource to test this on ppc or s390 platforms.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Start test case that models above flow
2. Watch it deadlock.
Actual Results: Process becomes deadlocked on a mutex.
Expected Results: Process should run indefinitely.
Here's a sample stack trace of one of the stuck threads:
Thread 7 (Thread 164772784 (LWP 10682)):
#0 0x007585ce in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x0035313b in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2 0xfef56e78 in ?? ()
#3 0x0804b800 in __JCR_LIST__ ()
#4 0x000029ba in ?? ()
#5 0x0035014a in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#6 0x080497a5 in LDAP::Queue<WorkItem*>::deQueue(WorkItem**) ()
#7 0x08049602 in Worker::run() ()
#8 0x080493c1 in __run ()
#9 0x0034e9ea in start_thread () from /lib/tls/libpthread.so.0
#10 0x00f57247 in clone () from /lib/tls/libc.so.6
Created attachment 93766 [details]
The source and compiled binaries for the test case described above
Note that when running this test case under NPTL the output messages are out
of sync. I was too lazy to put a lock around all the printf's/cout's. It
matter though, just run the case and wait until the program stops spewing text.
I looked at it. I can provide a work-around when necessary but would like to
avoid this. I'll continue to look at it and if necessary get the work-around
committed and somewhat tested. This bug is definitely at the top of my list.
Thanks for the test case. With the code in the new release I haven't been able
to produce any lockups anymore.
Is the 'production' level available in the glibc-2.3.2-89 via RHN? I looked
through the changelog but didn't see anything pertaining to this problem. I
might not be looking in the right place, as I don't know the scope of the
fix. But I would like to verify this problem is fixed.
If you read the bugzilla page carefully you'll see that I explicitly noted that
the first version with the fix is 2.3.2-90 (see the "Fixed In" field). I don't
know if and when this version is available via Sushi. I hope it is already. If
it's not that's not in my realm and you'll have to talk to your RH contact to
try getting it.