This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 102682 - NPTL deadlocks from C/C++ application
NPTL deadlocks from C/C++ application
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: glibc (Show other bugs)
3.0
i386 Linux
high Severity high
: ---
: ---
Assigned To: Ulrich Drepper
Brian Brock
:
Depends On:
Blocks: 97942
  Show dependency treegraph
 
Reported: 2003-08-19 16:12 EDT by Mark Cavage
Modified: 2007-11-30 17:06 EST (History)
4 users (show)

See Also:
Fixed In Version: 2.3.2-90
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-09-22 18:32:16 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
The source and compiled binaries for the test case described above (18.88 KB, application/octet-stream)
2003-08-19 16:16 EDT, Mark Cavage
no flags Details

  None (edit)
Description Mark Cavage 2003-08-19 16:12:50 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030529

Description of problem:
When using a multi-threaded application, it is fairly easy to get the
NPTL-based code to deadlock, whereas the same test case will run
indefinitely on linuxthreads. As a reference to verify it is not the
application, I ran the same application on an SMP solaris machine without
problems.  

I have a simple test case that can be used to demonstrate the failure.

Basically, the test case looks like this:

1 thread pushing requests onto a queue
15 threads pulling requests off the queue.

Note that neither of the thread pools do anything other than allocate
and free a pointer to the work request. It takes around 30 seconds to a minute
to freeze on an IBM x255 machine (4 xeon 1.6Ghz processors with 8Gb memory and a
RAID 5 array of disks).

I already ran up2date to bring up my glibc and kernel levels.  uname -a gives:

Linux ldapdut009 2.4.21-1.1931.2.389.entbigmem #1 SMP Mon Aug 11 10:12:45 EDT
2003 i686 i686 i386 GNU/Linux

And /lib/libc.so.6 gives:
/lib/libc.so.6
GNU C Library stable release version 2.3.2, by Roland McGrath et al.
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 3.2.3 20030502 (Red Hat Linux 3.2.3-13).
Compiled on a Linux 2.4.20 system on 2003-08-12.
Available extensions:
        GNU libio by Per Bothner
        crypt add-on version 2.1 by Michael Glad and others
        linuxthreads-0.10 by Xavier Leroy
        The C stubs add-on version 2.1.2.
        BIND-8.2.3-T5B
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
        Glibc-2.0 compatibility add-on by Cristian Gafton
        libthread_db work sponsored by Alpha Processor Inc
Thread-local storage support included.
Report bugs using the `glibcbug' script to <bugs@gnu.org>.

Originally I came across this problem by benchmarking our LDAP server (IBM, not
openldap) on RHEL 3, but under full CPU utilization the process would deadlock
after about 30 seconds.  When I looked at the mutex that was the culprit, the
__m_owner field was set to a thread that wasn't in my process.  So, being very
confused I tried to isolate the problem to thread contention for a lock by
writing the test case above (the test models our work dispatcher).

I haven't had a chance/hardware resource to test this on ppc or s390 platforms.


Version-Release number of selected component (if applicable):
glibc-2.3.2-70

How reproducible:
Always

Steps to Reproduce:
1. Start test case that models above flow
2. Watch it deadlock.
    

Actual Results:  Process becomes deadlocked on a mutex.

Expected Results:  Process should run indefinitely.

Additional info:

Here's a sample stack trace of one of the stuck threads:

Thread 7 (Thread 164772784 (LWP 10682)):
#0  0x007585ce in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0035313b in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2  0xfef56e78 in ?? ()
#3  0x0804b800 in __JCR_LIST__ ()
#4  0x000029ba in ?? ()
#5  0x0035014a in pthread_mutex_lock () from /lib/tls/libpthread.so.0
#6  0x080497a5 in LDAP::Queue<WorkItem*>::deQueue(WorkItem**) ()
#7  0x08049602 in Worker::run() ()
#8  0x080493c1 in __run ()
#9  0x0034e9ea in start_thread () from /lib/tls/libpthread.so.0
#10 0x00f57247 in clone () from /lib/tls/libc.so.6
Comment 1 Mark Cavage 2003-08-19 16:16:27 EDT
Created attachment 93766 [details]
The source and compiled binaries for the test case described above

Note that when running this test case under NPTL the output messages are out
of sync.  I was too lazy to put a lock around all the printf's/cout's.	It
doesn't
matter though, just run the case and wait until the program stops spewing text.
Comment 4 Ulrich Drepper 2003-09-08 17:47:23 EDT
I looked at it.  I can provide a work-around when necessary but would like to
avoid this.  I'll continue to look at it and if necessary get the work-around
committed and somewhat tested.  This bug is definitely at the top of my list.
Comment 11 Ulrich Drepper 2003-09-22 18:32:16 EDT
Thanks for the test case.  With the code in the new release I haven't been able
to produce any lockups anymore.
Comment 12 Mark Cavage 2003-09-23 14:16:35 EDT
Is the 'production' level available in the glibc-2.3.2-89 via RHN?  I looked 
through the changelog but didn't see anything pertaining to this problem.  I 
might not be looking in the right place, as I don't know the scope of the 
fix.  But I would like to verify this problem is fixed.
Comment 13 Ulrich Drepper 2003-09-23 14:23:56 EDT
If you read the bugzilla page carefully you'll see that I explicitly noted that
the first version with the fix is 2.3.2-90 (see the "Fixed In" field).  I don't
know if and when this version is available via Sushi.  I hope it is already.  If
it's not that's not in my realm and you'll have to talk to your RH contact to
try getting it.

Note You need to log in before you can comment on or make changes to this bug.