Bug 115349

Summary: mutex hang when using pthread_cond_broadcast() under high contention
Product: Red Hat Enterprise Linux 3 Reporter: John G. Myers <jgmyers>
Component: glibcAssignee: Jakub Jelinek <jakub>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: drepper, jbs, roland, szabka, tao, van.okamura
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-05-12 01:28:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Test program none

Description John G. Myers 2004-02-11 01:57:21 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6)
Gecko/20040113

Description of problem:
The attached test program hangs when run on a dual Xenon 2.4 GHZ box.

The main thread (and some of the worker threads) blocks in futex_wait,
waiting to acquire the mutex "mtx", which is unlocked.  Attaching and
detaching a debugger causes the program to continue, as does sending
the process a STOP and CONT signal.



Version-Release number of selected component (if applicable):
glibc-2.3.2-95.6

How reproducible:
Always

Steps to Reproduce:
1. Compile the attached program with
cc -o cvtest cvtest.c -lpthread

2. In one window, run a server process
./cvtest -s

3. In the other window, run the test client
./cvtest -b


Actual Results:  The test client will hang within minutes.  Attach a
debugger and examine the main thread--it will be in the futex syscall,
inside __lll_mutex_lock_wait.  The futex for the associated mutex will
have value 0.



Expected Results:  The test client continues to print '.' characters.

If you examine the worker threads, you will find some also hanging in
the futex syscall for the same unlocked futex.


Additional info:

If you instead run cvtest with no arguments, causing it to never use
pthread_cond_broadcast(), it will not hang.

Comment 1 John G. Myers 2004-02-11 01:58:21 UTC
Created attachment 97574 [details]
Test program

Comment 2 John G. Myers 2004-02-11 19:04:19 UTC
Kernel is 2.4.21-9.ELsmp


Comment 3 Jakub Jelinek 2004-02-13 07:18:43 UTC
Could you please try ftp://people.redhat.com/jakub/glibc/errata/2.3.2-95.10/
These packages have temporarily disabled FUTEX_REQUEUE.

Comment 4 John G. Myers 2004-02-13 20:56:38 UTC
The bug does not reproduce with 2.3.2-95.10.


Comment 5 Kenneth C. Schalk 2004-04-29 21:03:59 UTC
I've seen this same bug with the Boehm-Demers-Weiser conservative
garbage collector (aka libgc):

http://www.hpl.hp.com/personal/Hans_Boehm/gc/

It was fixed by the updated glibc I got from here:

ftp://people.redhat.com/jakub/glibc/errata/2.3.2-95.20/

Comment 6 John Flanagan 2004-05-12 01:28:25 UTC
An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-212.html


Comment 7 Jakub Jelinek 2004-05-12 13:42:54 UTC
Here is a simplified reproducer (hangs with -b with glibc which
doesn't have FUTEX_REQUEUE (or FUTEX_CMP_REQUEUE) commented out):

#define _XOPEN_SOURCE 500
#include <unistd.h>
#include <stdlib.h>
#include <pthread.h>

pthread_mutex_t mtx;
pthread_cond_t cv;
int broadcast;
int nn;

void *
tf (void *arg)
{
  for (;;)
    {
      pthread_mutex_lock (&mtx);
      while (!nn)
        pthread_cond_wait (&cv, &mtx);
      --nn;
      pthread_mutex_unlock (&mtx);
    }
}

int
main (int argc, char **argv)
{
  int i, spins = 0;
  pthread_mutexattr_t mtxa;

  pthread_mutexattr_init (&mtxa);
  pthread_mutexattr_settype (&mtxa, PTHREAD_MUTEX_ERRORCHECK_NP);
  pthread_mutex_init (&mtx, &mtxa);
  pthread_cond_init (&cv, NULL);

  if (argc > 1)
    {
      if (!strcmp (argv[1], "-b"))
        broadcast = 1;
      else if (!strcmp (argv[1], "-B"))
        broadcast = 2;
    }

  for (i = 0; i < 40; i++)
    {
      pthread_t th;
      pthread_create (&th, NULL, tf, NULL);
    }

  pthread_mutex_lock (&mtx);
  for (;;)
    {
      if ((spins++ % 1000) == 0)
        write (1, ".", 1);


      pthread_mutex_unlock (&mtx);

      pthread_mutex_lock (&mtx);

      int njobs = rand () % 41;
      nn = njobs;
      if (broadcast && (broadcast > 1 || (rand () % 30) == 0))
        pthread_cond_broadcast (&cv);
      else
        while (njobs--)
          pthread_cond_signal (&cv);
    }
}

It happens even if cond->__data.__lock is held during the futex (FUTEX_REQUEUE)
syscall and only hangs with -b option, doesn't hang without any options
or with -B, so mixing pthread_cond_broadcast with pthread_cond_signal
syscalls is essential.

Comment 8 Van Okamura 2004-05-29 03:49:05 UTC
*** Bug 121283 has been marked as a duplicate of this bug. ***

Comment 9 Paul Waterman 2004-06-02 19:56:02 UTC
Was this bug accidentally linked to the wrong errata?

I fail to see how an updated shadow-utils rpm resolves a problem with
glibc/pthreads...

Comment 10 Ulrich Drepper 2004-06-02 23:31:13 UTC
No, the reference is correct.  shadow-utils has to be updated in
addition to glibc.