From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113 Description of problem: The attached test program hangs when run on a dual Xenon 2.4 GHZ box. The main thread (and some of the worker threads) blocks in futex_wait, waiting to acquire the mutex "mtx", which is unlocked. Attaching and detaching a debugger causes the program to continue, as does sending the process a STOP and CONT signal. Version-Release number of selected component (if applicable): glibc-2.3.2-95.6 How reproducible: Always Steps to Reproduce: 1. Compile the attached program with cc -o cvtest cvtest.c -lpthread 2. In one window, run a server process ./cvtest -s 3. In the other window, run the test client ./cvtest -b Actual Results: The test client will hang within minutes. Attach a debugger and examine the main thread--it will be in the futex syscall, inside __lll_mutex_lock_wait. The futex for the associated mutex will have value 0. Expected Results: The test client continues to print '.' characters. If you examine the worker threads, you will find some also hanging in the futex syscall for the same unlocked futex. Additional info: If you instead run cvtest with no arguments, causing it to never use pthread_cond_broadcast(), it will not hang.
Created attachment 97574 [details] Test program
Kernel is 2.4.21-9.ELsmp
Could you please try ftp://people.redhat.com/jakub/glibc/errata/2.3.2-95.10/ These packages have temporarily disabled FUTEX_REQUEUE.
The bug does not reproduce with 2.3.2-95.10.
I've seen this same bug with the Boehm-Demers-Weiser conservative garbage collector (aka libgc): http://www.hpl.hp.com/personal/Hans_Boehm/gc/ It was fixed by the updated glibc I got from here: ftp://people.redhat.com/jakub/glibc/errata/2.3.2-95.20/
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-212.html
Here is a simplified reproducer (hangs with -b with glibc which doesn't have FUTEX_REQUEUE (or FUTEX_CMP_REQUEUE) commented out): #define _XOPEN_SOURCE 500 #include <unistd.h> #include <stdlib.h> #include <pthread.h> pthread_mutex_t mtx; pthread_cond_t cv; int broadcast; int nn; void * tf (void *arg) { for (;;) { pthread_mutex_lock (&mtx); while (!nn) pthread_cond_wait (&cv, &mtx); --nn; pthread_mutex_unlock (&mtx); } } int main (int argc, char **argv) { int i, spins = 0; pthread_mutexattr_t mtxa; pthread_mutexattr_init (&mtxa); pthread_mutexattr_settype (&mtxa, PTHREAD_MUTEX_ERRORCHECK_NP); pthread_mutex_init (&mtx, &mtxa); pthread_cond_init (&cv, NULL); if (argc > 1) { if (!strcmp (argv[1], "-b")) broadcast = 1; else if (!strcmp (argv[1], "-B")) broadcast = 2; } for (i = 0; i < 40; i++) { pthread_t th; pthread_create (&th, NULL, tf, NULL); } pthread_mutex_lock (&mtx); for (;;) { if ((spins++ % 1000) == 0) write (1, ".", 1); pthread_mutex_unlock (&mtx); pthread_mutex_lock (&mtx); int njobs = rand () % 41; nn = njobs; if (broadcast && (broadcast > 1 || (rand () % 30) == 0)) pthread_cond_broadcast (&cv); else while (njobs--) pthread_cond_signal (&cv); } } It happens even if cond->__data.__lock is held during the futex (FUTEX_REQUEUE) syscall and only hangs with -b option, doesn't hang without any options or with -B, so mixing pthread_cond_broadcast with pthread_cond_signal syscalls is essential.
*** Bug 121283 has been marked as a duplicate of this bug. ***
Was this bug accidentally linked to the wrong errata? I fail to see how an updated shadow-utils rpm resolves a problem with glibc/pthreads...
No, the reference is correct. shadow-utils has to be updated in addition to glibc.