Bug 102709

Summary: NPTL pthread_cond_broadcast hangs.
Product: [Retired] Red Hat Linux Reporter: Dennis <dennis>
Component: glibcAssignee: Jakub Jelinek <jakub>
Status: CLOSED RAWHIDE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: high    
Version: 9CC: drepper, fweimer, riel
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: 2.3.2-81 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-09-08 07:55:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Test case for the NPTL issue.
none
The source code for the class the stalls.
none
Source code for pthread_cond_broadcast stall. none

Description Dennis 2003-08-20 02:35:45 UTC
Description of problem:

On a uni-processor kernel our application stalls on a
pthread_cond_broadcast whilst using the NPTL threads
library. Using LinuxThreads works.

Version-Release number of selected component (if applicable):

RedHat 9 and the Beta "severn" exhibit this problem. Likewise
kernel 2.6test3 on a RedHat 9 base also fails.

How reproducible:

The stall appears to occur quite often on a number of different
machines. The machines need to be a single CPU i386 type machine.


Steps to Reproduce:

I will attach a tar file to this bug which will contain our
application and README file explaining the problem.    

Thanks

Dennis.

Comment 1 Dennis 2003-08-20 02:37:40 UTC
Created attachment 93772 [details]
Test case for the NPTL issue.

Comment 2 Dennis 2003-08-21 06:32:22 UTC
Some extra information. We have just installed 'taroon' on a quad
CPU itanium2 HP machine (900Mhz RX5760). I get the same 'stall'
on the pthread_cond_broadcast function call when using
NPTL. So the problem is more a timing issue rather than an SMP 
one it looks like. Apologies for any mis-leading information.

'taroon' uses NPTL version 0.52, kernel 2.4.21 and glibc 2.3.2-63

Using gdb it should be noted that thread 3 is the one that is
stuck on the pthread_cond_broadcast function.

Hopefully the extra information is handy.

Dennis.


Comment 3 Jakub Jelinek 2003-08-26 08:19:58 UTC
Binaries in the testcase are not very helpful, because we cannot check whether
the locking is sane.
Can you create a small testcase (with complete source) which shows the same behaviour?

Comment 4 Dennis 2003-08-27 03:46:36 UTC
Jakub I can understand your need for source code.

My original intent was to provide a simple test
case. However when I started to strip away unrelated
functionality away the problem went away. So the
problem appeared to be quite sensitive timing wise.

I will again re-double my efforts to see whether
I can produce a more minimal case with source code.
This may take some time, if it is achievable. This
will be difficult.

Till then I will attach the source code for the
actual class that gets stuck. The BtsProcessor
class is instansiated, its destructor gets called,
it tries to do a broadcast to a condition variable
in a timed wait and never comes back. The BtsProcessor
class is the only class compiled into the libWSMBoots.so
library. It is not dependant on any other threads or
related mutexes/condition variables. Rather the 
architecture is the other way around. The 
interaction of the object and the application causes
the problem. In a simple application the BtsProcessor
destructor always works.

I know that this will not be satisfactory. At the moment
my minimal test case is 50,000 lines of C++. Not very 
small. The architecture for our application is complex,
but I believe it is sound, it has worked for many years
on Solaris, standard Linux and Windows.

However you need source code. I will be in touch.

Dennis.






Comment 5 Dennis 2003-08-27 03:51:03 UTC
Created attachment 93969 [details]
The source code for the class the stalls.

I will be trying to cut down our application so
that a minimal test case (including source)
can be provided.

However this may take a while. So I have provided
the source for the class that stalls.

Dennis.

Comment 6 Dennis 2003-09-01 05:24:19 UTC
A minimal test case with source, highlighting the issue has
been successfully produced. The new attachment should 
showcase the issue.

Hopefully this helps you guys ascertain what going on 
with NPTL.

Dennis.

Comment 7 Dennis 2003-09-01 05:27:10 UTC
Created attachment 94110 [details]
Source code for pthread_cond_broadcast stall.

Hopefully the supplied test case hangs in a similar way
at your place is it does here.

Dennis.

Comment 8 Jakub Jelinek 2003-09-01 09:55:28 UTC
Primarily there is a bug in your testcase. When libsupport.so uses pthread_create,
pthread_cond_timedwait etc., it must be linked with -lpthread (so that right
symbol versions are assigned to it among other things).
Plus there is a glibc problem which doesn't handle this too well, see
http://sources.redhat.com/ml/libc-hacker/2003-09/msg00002.html
No matter what, please fix your application. E.g. pass -Wl,-z,defs to gcc during
every linking and it will give you hard errors any time you miss needed dependencies.

Comment 9 Dennis 2003-09-02 02:18:03 UTC
Your suggestion does indeed work for our application, very very good.

Apologies for taking up your time, I do feel a little sheepish. However 
the changes to glibc are a good end result for all (others in future 
will not be caught out like we were).

Thanks to all the Red Hat engineers. As usual excellent support has 
been provided.

I leave you with one little tidbit. I carried out a database load of XML
data into our TeraText Content Server (its a database server), with 
LinuxThreads the load took 28 minutes, with NPTL the load took 21 minutes.
Thats a nice 25% improvement for a real world application!

This bug can be closed. Thanks again (good gcc tip as well).

Dennis.



Comment 10 Jakub Jelinek 2003-09-08 07:55:06 UTC
pthread_cond_timedwait stubs in libc.so are in glibc-2.3.2-{81,82}.

Comment 11 Dennis 2003-09-09 00:37:06 UTC
Thanks Jakub (and fellow Red Hat Engineers), we really do appreciate
the help you have provided. 

We keenly await the new Red Hat Enterprise Linux for Itanium (due in the 
next few months), that is going to be a new platform for our software.
Exciting times.

Thanks.

Dennis.