Bug 102682

Summary: NPTL deadlocks from C/C++ application
Product: Red Hat Enterprise Linux 3 Reporter: Mark Cavage <mcavage>
Component: glibcAssignee: Ulrich Drepper <drepper>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: high    
Version: 3.0CC: bennet, drepper, flanagan, fweimer, roland
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: 2.3.2-90 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-09-22 22:32:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 97942    
Attachments:
Description Flags
The source and compiled binaries for the test case described above none

Description Mark Cavage 2003-08-19 20:12:50 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030529

Description of problem:
When using a multi-threaded application, it is fairly easy to get the
NPTL-based code to deadlock, whereas the same test case will run
indefinitely on linuxthreads. As a reference to verify it is not the
application, I ran the same application on an SMP solaris machine without
problems.  

I have a simple test case that can be used to demonstrate the failure.

Basically, the test case looks like this:

1 thread pushing requests onto a queue
15 threads pulling requests off the queue.

Note that neither of the thread pools do anything other than allocate
and free a pointer to the work request. It takes around 30 seconds to a minute
to freeze on an IBM x255 machine (4 xeon 1.6Ghz processors with 8Gb memory and a
RAID 5 array of disks).

I already ran up2date to bring up my glibc and kernel levels.  uname -a gives:

Linux ldapdut009 2.4.21-1.1931.2.389.entbigmem #1 SMP Mon Aug 11 10:12:45 EDT
2003 i686 i686 i386 GNU/Linux

And /lib/libc.so.6 gives:
/lib/libc.so.6
GNU C Library stable release version 2.3.2, by Roland McGrath et al.
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 3.2.3 20030502 (Red Hat Linux 3.2.3-13).
Compiled on a Linux 2.4.20 system on 2003-08-12.
Available extensions:
        GNU libio by Per Bothner
        crypt add-on version 2.1 by Michael Glad and others
        linuxthreads-0.10 by Xavier Leroy
        The C stubs add-on version 2.1.2.
        BIND-8.2.3-T5B
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
        Glibc-2.0 compatibility add-on by Cristian Gafton
        libthread_db work sponsored by Alpha Processor Inc
Thread-local storage support included.
Report bugs using the `glibcbug' script to <bugs>.

Originally I came across this problem by benchmarking our LDAP server (IBM, not
openldap) on RHEL 3, but under full CPU utilization the process would deadlock
after about 30 seconds.  When I looked at the mutex that was the culprit, the
__m_owner field was set to a thread that wasn't in my process.  So, being very
confused I tried to isolate the problem to thread contention for a lock by
writing the test case above (the test models our work dispatcher).

I haven't had a chance/hardware resource to test this on ppc or s390 platforms.


Version-Release number of selected component (if applicable):
glibc-2.3.2-70

How reproducible:
Always

Steps to Reproduce:
1. Start test case that models above flow
2. Watch it deadlock.
    

Actual Results:  Process becomes deadlocked on a mutex.

Expected Results:  Process should run indefinitely.

Additional info:

Here's a sample stack trace of one of the stuck threads:

Thread 7 (Thread 164772784 (LWP 10682)):
#0  0x007585ce in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0035313b in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2  0xfef56e78 in ?? ()
#3  0x0804b800 in __JCR_LIST__ ()
#4  0x000029ba in ?? ()
#5  0x0035014a in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#6  0x080497a5 in LDAP::Queue<WorkItem*>::deQueue(WorkItem**) ()
#7  0x08049602 in Worker::run() ()
#8  0x080493c1 in __run ()
#9  0x0034e9ea in start_thread () from /lib/tls/libpthread.so.0
#10 0x00f57247 in clone () from /lib/tls/libc.so.6

Comment 1 Mark Cavage 2003-08-19 20:16:27 UTC
Created attachment 93766 [details]
The source and compiled binaries for the test case described above

Note that when running this test case under NPTL the output messages are out
of sync.  I was too lazy to put a lock around all the printf's/cout's.	It
doesn't
matter though, just run the case and wait until the program stops spewing text.

Comment 4 Ulrich Drepper 2003-09-08 21:47:23 UTC
I looked at it.  I can provide a work-around when necessary but would like to
avoid this.  I'll continue to look at it and if necessary get the work-around
committed and somewhat tested.  This bug is definitely at the top of my list.

Comment 11 Ulrich Drepper 2003-09-22 22:32:16 UTC
Thanks for the test case.  With the code in the new release I haven't been able
to produce any lockups anymore.

Comment 12 Mark Cavage 2003-09-23 18:16:35 UTC
Is the 'production' level available in the glibc-2.3.2-89 via RHN?  I looked 
through the changelog but didn't see anything pertaining to this problem.  I 
might not be looking in the right place, as I don't know the scope of the 
fix.  But I would like to verify this problem is fixed.

Comment 13 Ulrich Drepper 2003-09-23 18:23:56 UTC
If you read the bugzilla page carefully you'll see that I explicitly noted that
the first version with the fix is 2.3.2-90 (see the "Fixed In" field).  I don't
know if and when this version is available via Sushi.  I hope it is already.  If
it's not that's not in my realm and you'll have to talk to your RH contact to
try getting it.