Red Hat Bugzilla – Bug 102682
NPTL deadlocks from C/C++ application
Last modified: 2016-11-24 10:18:35 EST
From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030529 Description of problem: When using a multi-threaded application, it is fairly easy to get the NPTL-based code to deadlock, whereas the same test case will run indefinitely on linuxthreads. As a reference to verify it is not the application, I ran the same application on an SMP solaris machine without problems. I have a simple test case that can be used to demonstrate the failure. Basically, the test case looks like this: 1 thread pushing requests onto a queue 15 threads pulling requests off the queue. Note that neither of the thread pools do anything other than allocate and free a pointer to the work request. It takes around 30 seconds to a minute to freeze on an IBM x255 machine (4 xeon 1.6Ghz processors with 8Gb memory and a RAID 5 array of disks). I already ran up2date to bring up my glibc and kernel levels. uname -a gives: Linux ldapdut009 2.4.21-1.1931.2.389.entbigmem #1 SMP Mon Aug 11 10:12:45 EDT 2003 i686 i686 i386 GNU/Linux And /lib/libc.so.6 gives: /lib/libc.so.6 GNU C Library stable release version 2.3.2, by Roland McGrath et al. Copyright (C) 2003 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 3.2.3 20030502 (Red Hat Linux 3.2.3-13). Compiled on a Linux 2.4.20 system on 2003-08-12. Available extensions: GNU libio by Per Bothner crypt add-on version 2.1 by Michael Glad and others linuxthreads-0.10 by Xavier Leroy The C stubs add-on version 2.1.2. BIND-8.2.3-T5B NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk Glibc-2.0 compatibility add-on by Cristian Gafton libthread_db work sponsored by Alpha Processor Inc Thread-local storage support included. Report bugs using the `glibcbug' script to <bugs@gnu.org>. Originally I came across this problem by benchmarking our LDAP server (IBM, not openldap) on RHEL 3, but under full CPU utilization the process would deadlock after about 30 seconds. When I looked at the mutex that was the culprit, the __m_owner field was set to a thread that wasn't in my process. So, being very confused I tried to isolate the problem to thread contention for a lock by writing the test case above (the test models our work dispatcher). I haven't had a chance/hardware resource to test this on ppc or s390 platforms. Version-Release number of selected component (if applicable): glibc-2.3.2-70 How reproducible: Always Steps to Reproduce: 1. Start test case that models above flow 2. Watch it deadlock. Actual Results: Process becomes deadlocked on a mutex. Expected Results: Process should run indefinitely. Additional info: Here's a sample stack trace of one of the stuck threads: Thread 7 (Thread 164772784 (LWP 10682)): #0 0x007585ce in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x0035313b in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 #2 0xfef56e78 in ?? () #3 0x0804b800 in __JCR_LIST__ () #4 0x000029ba in ?? () #5 0x0035014a in pthread_mutex_lock () from /lib/tls/libpthread.so.0 #6 0x080497a5 in LDAP::Queue<WorkItem*>::deQueue(WorkItem**) () #7 0x08049602 in Worker::run() () #8 0x080493c1 in __run () #9 0x0034e9ea in start_thread () from /lib/tls/libpthread.so.0 #10 0x00f57247 in clone () from /lib/tls/libc.so.6
Created attachment 93766 [details] The source and compiled binaries for the test case described above Note that when running this test case under NPTL the output messages are out of sync. I was too lazy to put a lock around all the printf's/cout's. It doesn't matter though, just run the case and wait until the program stops spewing text.
I looked at it. I can provide a work-around when necessary but would like to avoid this. I'll continue to look at it and if necessary get the work-around committed and somewhat tested. This bug is definitely at the top of my list.
Thanks for the test case. With the code in the new release I haven't been able to produce any lockups anymore.
Is the 'production' level available in the glibc-2.3.2-89 via RHN? I looked through the changelog but didn't see anything pertaining to this problem. I might not be looking in the right place, as I don't know the scope of the fix. But I would like to verify this problem is fixed.
If you read the bugzilla page carefully you'll see that I explicitly noted that the first version with the fix is 2.3.2-90 (see the "Fixed In" field). I don't know if and when this version is available via Sushi. I hope it is already. If it's not that's not in my realm and you'll have to talk to your RH contact to try getting it.