The following has be reported by IBM LTC: NPTL: pthread_condtimedwait hang or mutex_lock hang Hardware Environment: XSeries (IA32 bit machine), Architecture:i686 Software Environment: RHEL 3.0 GOLD. Kernel:2.4.21-4.ELsmp, glibc-2.3.2-95.3, nptl version: /lib/tls/libpthread-0.60.so This problem happens on all 3 RHEL 3.0 (AS/WS/ES), but most frequently on AS. Steps to Reproduce: 1.Create 2 threads (waitThread and sleepThread) and cancel them, call pthread_cond_timedwait() to be waken up 2. waitThread also use pthread condition but different condition variable waitThread calls pthread_cond_wait() until it gets cancelled. Once it's cancelled, waitThread_cleanup routine is invoked. In cleanup routine, it calls pthread_Cond_broadcase to wake up main thread. 3. sleepThread is sleeping until it gets cancelled. Once it's cancelled, sleepThread_cleanup routine is invoked. In cleanup routine, it also calls pthread_Cond_broadcase to wake up main thread. Actual Results: Two different symptoms 1) main thread hang because pthread_cond_timedwait is not waken up This is gdb stack trace (gdb) thread 2 [Switching to thread 2 (Thread -1220404304 (LWP 3538))]#0 0xb75ebc32 in _dl_sys info_int80 () from /lib/ld-linux.so.2 (gdb) where #0 0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0xb75d067b in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 #2 0x00000dd2 in ?? () #3 0x0804b6e0 in ?? () #4 0x0804b3fc in sleepcnt () #5 0xb75d23a8 in _L_mutex_cond_lock_28 () from /lib/tls/libpthread.so.0 #6 0x0804b6e0 in ?? () 2) Or, all thread hang to get a lock. Expected Results: Main thread is waken up and ends normally. It works on all other linux distro including RH9 more than 100 loop. It also works on RHEL 3 but in old linuxThread mode (which set LD_ASSUME_KERNEL) Additional Information: This is a test program (mtcond). It's compiled with gcc 2.95.3. There 3 command line options to run mtcond. usage: mtcond -l<loopcnt> [-c] [-s sleepcnt] loopcnt >=1 : Since sometimes problem doesn't happen in 1-st loop, need to run in a loop -c : cause waitThread is waiting in pthread_cond_wait() without -c option, waitThread is just sleeping, and no problem. -s : you may ignore this option. The typical usage of mtcond to regenerate this problem is : > mtcond -l100 -c /*************************************************************************** * FILENAME mtcond.c ***************************************************************************/ #include <pthread.h> #include <sys/types.h> #include <sys/stat.h> #include <errno.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <sys/mman.h> #include <fcntl.h> #include <signal.h> #include <assert.h> #define THREAD2_INIT 0 #define THREAD2_CREATED 1 #define THREAD2_CANCELED 2 #define THREAD2_ENDED 3 static int condmode = 0; static int loopcnt = 10; static int sleepcnt = 0; static pthread_mutex_t thread2_mutex = PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t thread_end_cv; typedef struct thread_status { int status; pthread_t tid; pthread_cond_t cond; int flag; int locked; }thread_status_t; void waitThread_cleanup(void *ptr) { int rc; thread_status_t *pStatus = (thread_status_t *)ptr; fprintf(stderr, " [WAIT_CLEANUP][%d]: Clean up is called\n", pStatus- >tid); if(pStatus->locked) { fprintf(stderr, " [WAIT_CLEANUP][%d] releasing previous lock\n", pStatus->tid); pStatus->locked = 0; rc = pthread_mutex_unlock(&thread2_mutex); assert(rc==0); fprintf(stderr, " [WAIT_CLEANUP][%d] released previous lock\n", pStatus->tid); } fprintf(stderr, " [WAIT_CLEANUP][%d] waiting lock\n", pStatus->tid); rc = pthread_mutex_lock(&thread2_mutex); assert(rc==0); pStatus->locked = 1; fprintf(stderr, " [WAIT_CLEANUP][%d] got_lock\n", pStatus->tid); pStatus->status = THREAD2_ENDED; fprintf(stderr, " [WAIT_CLEANUP][%d] sending cond_broadcast\n", pStatus- >tid); rc = pthread_cond_broadcast(&thread_end_cv); assert(rc==0); fprintf(stderr, " [WAIT_CLEANUP][%d] releasing lock\n", pStatus->tid); rc = pthread_mutex_unlock(&thread2_mutex); assert(rc==0); pStatus->locked = 0; fprintf(stderr, " [WAIT_CLEANUP][%d] released lock\n", pStatus->tid); fprintf(stderr, " [WAIT_CLEANUP][%d] Clean up is done\n", pStatus->tid); } void sleepThread_cleanup(void *ptr) { int rc; thread_status_t *pStatus = (thread_status_t *)ptr; fprintf(stderr, " [SLEEP_CLEANUP][%d]: Clean up is called\n", pStatus- >tid); if(!pStatus->locked) { fprintf(stderr, " [SLEEP_CLEANUP][%d] waiting lock\n", pStatus->tid); rc = pthread_mutex_lock(&thread2_mutex); assert(rc==0); pStatus->locked = 1; fprintf(stderr, " [SLEEP_CLEANUP][%d] got_lock\n", pStatus->tid); } pStatus->status = THREAD2_ENDED; fprintf(stderr, " [SLEEP_CLEANUP][%d] sending cond_broadcast\n", pStatus- >tid); rc = pthread_cond_broadcast(&thread_end_cv); assert(rc==0); fprintf(stderr, " [SLEEP_CLEANUP][%d] releasing lock\n", pStatus->tid); rc = pthread_mutex_unlock(&thread2_mutex); assert(rc==0); pStatus->locked = 0; fprintf(stderr, " [SLEEP_CLEANUP][%d] released lock\n", pStatus->tid); fprintf(stderr, " [SLEEP_CLEANUP][%d]: Clean up is done\n", pStatus- >tid); } void * sleepThread (void *status) { int i, rc; thread_status_t *tstatus = (thread_status_t *)status; tstatus->tid = pthread_self (); fprintf (stderr, " [SLEEP][%d]: sleepThread startup \n", tstatus->tid); tstatus->status = THREAD2_CREATED; pthread_cleanup_push(sleepThread_cleanup, tstatus); while(1) { sleep(1); pthread_testcancel(); } pthread_cleanup_pop(0); return status; } void * waitThread (void *status) { int i, rc; thread_status_t *tstatus = (thread_status_t *)status; tstatus->tid = pthread_self (); fprintf (stderr, " [WAIT][%d]: waitThread startup \n", tstatus->tid); tstatus->status = THREAD2_CREATED; pthread_cleanup_push(waitThread_cleanup, tstatus); while(1) { if(condmode) { fprintf(stderr, " [WAIT][%d] waiting lock\n", tstatus->tid); rc = pthread_mutex_lock(&thread2_mutex); assert(rc==0); tstatus->locked = 1; fprintf(stderr, " [WAIT][%d] got_lock\n", tstatus->tid); fprintf(stderr, " [WAIT][%d]:call cond_wait\n", tstatus->tid); rc = pthread_cond_wait(&tstatus->cond, &thread2_mutex); assert(rc==0); fprintf(stderr, " [WAIT][%d]:cond wake up\n", tstatus->tid); tstatus->locked = 0; rc = pthread_mutex_unlock(&thread2_mutex); assert(rc==0); fprintf(stderr, " WAIT[%d] release_lock\n", tstatus->tid); } else { sleep(1); pthread_testcancel(); } } pthread_cleanup_pop(0); return status; } #define WAIT_TIME_SECONDS 1 static void loop () { int i,j, rc; pthread_t tid; pthread_t wait_tid, sleep_tid; void *str[2]; int old_state; /* Former thread cancellation state */ thread_status_t wait_status, sleep_status; struct timespec ts; struct timeval tp; pthread_attr_t thread_attr; tid = pthread_self (); fprintf (stderr, "[%d]: loopThread startup \n", tid); sleep(1); rc = pthread_cond_init(&thread_end_cv, NULL); assert(rc==0); for (i = 0; i < loopcnt; i++) { fprintf(stderr, "###################################\n"); fprintf(stderr, "# %d-th loop start\n", i); fprintf(stderr, "###################################\n"); rc = pthread_attr_init (&thread_attr); assert(rc==0); wait_status.status = THREAD2_INIT; wait_status.flag = 1; wait_status.locked = 0; rc = pthread_cond_init( &wait_status.cond, NULL); assert(rc==0); rc = pthread_create (&wait_tid, &thread_attr, waitThread, &wait_status); if (rc) { fprintf (stderr, "[%d]: pthread_create fail (errno=%d)\n", tid, errno); break; } sleep(1); sleep_status.status = THREAD2_INIT; sleep_status.flag = 0; sleep_status.locked = 0; rc = pthread_create (&sleep_tid, &thread_attr, sleepThread, &sleep_status); if (rc) { fprintf (stderr, "[%d]: pthread_create fail (errno=%d)\n", tid, errno); break; } (void) pthread_attr_destroy(&thread_attr); sleep (2); if(condmode) { fprintf(stderr, "[LOOP][%d] calling cond_broadcast to WAIT_THREAD\n", tid); rc = pthread_cond_broadcast(&wait_status.cond); assert(rc==0); fprintf(stderr, "[LOOP][%d] called cond_broadcast to WAIT_THREAD\n", tid); if(sleepcnt) sleep(sleepcnt); } while(sleep_status.status != THREAD2_CREATED) sleep(1); fprintf(stderr, "[LOOP][%d] waiting lock before cancen sleepThread\n", tid); rc = pthread_mutex_lock(&thread2_mutex); assert(rc==0); fprintf(stderr, "[LOOP][%d] got_lock\n", tid); fprintf (stderr, "[LOOP]canceling SLEEP_THREAD [%d] \n", sleep_tid); rc = pthread_cancel (sleep_tid); assert(rc == 0); sleep_status.status = THREAD2_CANCELED; fprintf(stderr, "[LOOP][%d] releasing lock after cancel sleepThread\n", tid); rc = pthread_mutex_unlock(&thread2_mutex); assert(rc==0); fprintf(stderr, "[LOOP][%d] released lock\n", tid); fprintf(stderr, "[LOOP][%d] waiting lock before cancen waitThread\n", tid); rc = pthread_mutex_lock(&thread2_mutex); assert(rc==0); fprintf(stderr, "[LOOP][%d] got_lock\n", tid); while(wait_status.status != THREAD2_CREATED) sleep(1); fprintf (stderr, "[LOOP]canceling WAIT_THREAD [%d] \n", wait_tid); rc = pthread_cancel (wait_tid); assert(rc == 0); wait_status.status = THREAD2_CANCELED; fprintf(stderr, "[LOOP][%d] releasing lock after cancel waitThread\n", tid); rc = pthread_mutex_unlock(&thread2_mutex); assert(rc==0); fprintf(stderr, "[LOOP][%d] released lock\n", tid); fprintf(stderr, "[LOOP][%d] waiting lock before go into timedwait loop\n", tid); rc = pthread_mutex_lock(&thread2_mutex); assert(rc==0); fprintf(stderr, "[LOOP][%d] got_lock\n", tid); while(wait_status.status == THREAD2_CANCELED || sleep_status.status == THREAD2_CANCELED) { /* Usually worker threads will loop on these operations */ rc = gettimeofday(&tp, NULL); /* Convert from timeval to timespec */ ts.tv_sec = tp.tv_sec; ts.tv_nsec = tp.tv_usec * 1000; ts.tv_sec += WAIT_TIME_SECONDS; fprintf(stderr, "[LOOP]waiting in cond_timewait\n"); rc = pthread_cond_timedwait(&thread_end_cv, &thread2_mutex, &ts); fprintf(stderr, "[LOOP]cond_timewait return (rc=%d)\n", rc); } fprintf(stderr, "[LOOP][%d] releasing lock\n", tid); rc = pthread_mutex_unlock(&thread2_mutex); assert(rc==0); fprintf(stderr, "[LOOP][%d] released lock\n", tid); sleep (1); rc = pthread_join (sleep_tid, &str[1]); if (rc) { fprintf (stderr, "[%d]: pthread_join fail (errno=%d)\n", tid, errno); break; } fprintf(stderr, "SLEEP_THREAD is joined\n"); rc = pthread_join (wait_tid, &str[0]); if (rc) { fprintf (stderr, "[%d]: pthread_join fail (errno=%d)\n", tid, errno); break; } fprintf(stderr, "WAIT_THREAD is joined\n"); rc = pthread_cond_destroy( &wait_status.cond); assert(rc == 0); sleep(1); } rc = pthread_cond_destroy( &thread_end_cv); return; } main (int argc, char *argv[]) { int c; while ((c = getopt (argc, argv, "l:cs:")) != EOF) { switch (c) { case 's': sleepcnt = atoi(optarg); break; case 'c': condmode = 1; break; case 'l': loopcnt = atoi (optarg); loopcnt = atoi (optarg); break; } } loop(); exit (0); }I put customer severity to 1 to get more attention.Sachin / Sreelatha - as I mentioned in our telecon this morning, please have your team look into this. This should be your highest priority. The bug submitter also indicated that she got the same result even if she compiled the testcase on RHEL3 using GCC 3.2 and its toolchain. She also noted that the problem did not happen on RH9 with NPTL. It also did not hang using the old LinuxThreads on RHEL3. The problem only shows up on RHEL3 with NPTL. Thanks.I think the severity of this bug should be BLOCK. As a result of the new severity level, I'd like Glen/Greg to go ahead and submit this bug to Red Hat immediately. We need this to be resolved by RHEL3 Update1. Thanks.Glen/Greg - since this now has become a Blocking bug, please submit this to Red Hat as soon as you can. The India team is also working on this bug in parallel. Thanks.
------ Additional Comments From srikrishnan.com 2003-31-10 11:37 ------- Attempting to install RHEL3 and recreate on a IA-32(xSeries) system. Shall also verify by using the latest plain vanilla kernel (2.6.0-test9) on the same system during the weekend.
------ Additional Comments From srikrishnan.com 2003-03-11 09:39 ------- Recreated the problem on RHEL3. Also occurs with 2.6.0-test9 kernel. during 1st loop itself: [LOOP]waiting in cond_timewait [WAIT_CLEANUP][-1219912784] released lock [WAIT_CLEANUP][-1219912784] Clean up is done [SLEEP_CLEANUP][-1230402640]: Clean up is called [SLEEP_CLEANUP][-1230402640] waiting lock
------ Additional Comments From agar.com 2003-03-11 22:35 ------- The first test program provided in this defect illustrates there seems to be a problem associated with thread cancelation, condition wait, and mutexes. The following program is a simpler illustration of the problem. It fails and succeeds in the same environments as the first program. ------------------------- Beginning of source code --------------------------- #include <pthread.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <errno.h> #include <assert.h> static pthread_mutex_t gbl_mutex = PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t gbl_condv = PTHREAD_COND_INITIALIZER; void waitThread_cleanup(void *arg) { int rc; rc = pthread_mutex_unlock(&gbl_mutex); assert(rc == 0); return; } void * waitThread(void *arg) { int rc; pthread_cleanup_push(waitThread_cleanup, NULL); rc = pthread_mutex_lock(&gbl_mutex); assert(rc == 0); /* wait until this thread is canceled */ while (1 == 1) { rc = pthread_cond_wait(&gbl_condv, &gbl_mutex); assert(rc == 0); } /* this routine never reaches this point */ rc = pthread_mutex_unlock(&gbl_mutex); assert(rc == 0); pthread_cleanup_pop(0); return NULL; } main (int argc, char *argv[]) { int i, rc; pthread_t wait_tid; for (i = 0; i < 1000000; i++) { fprintf(stderr, "loop %d ", i); rc = pthread_create(&wait_tid, NULL, waitThread, NULL); assert(rc == 0); rc = pthread_cancel(wait_tid); assert(rc == 0); rc = pthread_join(wait_tid, NULL); assert(rc == 0); } return; } ----------------------------- End of source code ----------------------------- The initial thread simply loops doing the following: create another thread, cancel the other thread, and join with the other thread. The created thread simply loops on calls to pthread_cond_wait() until it is canceled. The thread cancelation cleanup routine simply unlocks the mutex associated with the condition variable. When the program fails, the second time a thread is created, the created thread hangs forever attempting to lock the mutex. Here are what the threads look like in gdb when the hang occurs: (gdb) where #0 0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0xb75ccc1d in pthread_join () from /lib/tls/libpthread.so.0 #2 0x0804875b in main (argc=1, argv=0xbfffe2e4) at mtcond4.c:62 (gdb) thread 2 [Switching to thread 2 (Thread -1219953744 (LWP 8736))]#0 0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 (gdb) where #0 0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0xb75d067b in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 #2 0x00002220 in ?? () #3 0xb75d4b3c in __JCR_LIST__ () from /lib/tls/libpthread.so.0 #4 0x080498e0 in p.0 () #5 0xb75cd787 in _L_mutex_lock_28 () from /lib/tls/libpthread.so.0 #6 0xb75d4b3c in __JCR_LIST__ () from /lib/tls/libpthread.so.0 #7 0xb748fbb0 in ?? () #8 0xb748fa98 in ?? () #9 0x08048658 in waitThread (arg=0x80498e0) at mtcond4.c:27
------ Additional Comments From srikrishnan.com 2003-04-11 13:19 ------- I concur with the previous post. It seems when a thread waiting on a condition variable is cancelled, it seems to be taking away the lock along with it. (Though we call mutuex_unlock in the cleanup function.) A print of gbl_mutex shows that mutex is "held" by the LWP which was cancelled and glb_condv is "waiting". (gdb) print gbl_mutex $1 = {__m_reserved = 2, __m_count = 0, __m_owner = 0x750c, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}} (gdb) print gbl_condv $2 = {__c_lock = {__status = 0, __spinlock = 0}, __c_waiting = 0x1, __padding =
Can you try ftp://people.redhat.com/jakub/glibc/errata/2.3.2-95.5/ ?
------ Additional Comments From agar.com 2003-06-11 11:30 ------- Could I please get some clarification on the last posting? That post said: The reason for the lockup is that the cleanup handlers pthread_cond_wait and the cleanup handler your code register are not executed in the correct order. And this happens because you are using the LinuxThreads headers. If you'd use the NPTL headers everything would work fine. How is it that our code is using LinuxThreads headers? Is there some rpm that our system should have? Was there some compiler option we should have specified? As an implementer of code that heavily uses threads and runs on a variety of Unix platforms, I expect to be able to code to the Posix standard and have my program work. I certainly expect to be able to compile the program on a system and have it run correctly on that same system. It seems astonishing that I could do a simple compile on the system, get LinuxThreads headers, then run the program on that same system, the program runs with the NPTL library, and the system cannot handle it. What is going on? Of course the real issue is binary compatability. I should be able to compile my code on a prior level of Linux and have the binary run on the newer levels of Linux. Is the patch alluded to in the following paragraph a patch to allow such binary compatability? Will systems with NPTL run threaded programs compiled on systems that did not have NPTL? Since we unfortunately cannot pull the plug on the LT headers yet Jakub came up with a patch which should handle this case. The current CVS has it, as will future releases and updates.
------ Additional Comments From srikrishnan.com 2003-06-11 07:38 ------- Response from Ulrich Drepper in the nptl mailing list(phil-list) http://www.redhat.com/archives/phil-list/2003-November/msg00003.html <quote> The reason for the lockup is that the cleanup handlers pthread_cond_wait and the cleanup handler your code register are not executed in the correct order. And this happens because you are using the LinuxThreads headers. If you'd use the NPTL headers everything would work fine. Since we unfortunately cannot pull the plug on the LT headers yet Jakub came up with a patch which should handle this case. The current CVS has it, as will future releases and updates. </quote>
------ Additional Comments From srikrishnan.com 2003-07-11 09:31 ------- It was a bug with glibc RPMs. Using glibc 2.3.2-95.5 (errata) RPMs from Jakub's site, fixes the problem. Kindly verify by upgrading the RPMs. Problem description: "try to maintain correct order of cleanups between those registered with __attribute__((cleanup)) and with LinuxThreads style pthread_cleanup_push/pop(#108631)" (The # refers to the bug# in Red Hat Bugzilla opened for tracking this bug) I tried running both the test programs. 1st program (original test case) ran fine for both -l100 and -1000( loop = 1000). The 2nd program also ran without any hitch. I tried to run the program compiled on RHEL3 (with glibc upgraded to 2.3.2- 95.5), on a SuSE machine with glibc 2.2.5. It gives an error that "version GLIBC_2.3.2 not found" ..'needed by ./mtcond'. I compiled the testcase on this SuSE machine (glibc 2.2.5;gcc 3.2). I ran that program on RHEL3 (with glibc upgraded to 2.3.2-95.5). It ran fine. It made use of /lib/tls/libpthread.so.0, as verified using ldd. So, a program compiled on lower level of glibc (which even does not support NPTL and uses linuxthreads), runs well on RHEL3 which comes with NPTL as the default threading model.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2003-334.html
------ Additional Comments From khoa.com 2003-14-11 11:32 ------- Glen/Greg - the errata from Jakub's website fixes this bug....do you know when it will be officially released by RH ?
------ Additional Comments From keshav.com 2003-14-11 11:46 ------- Will this fix be released on the Web first before getting on the update CD. Can we have the date for both web availability(With location) and update CD availability.
------ Additional Comments From dichung.com 2003-12-17 13:57 ------- The glibc patch files in the URL is not downloadable. There are only list but not linked to real file. Should we re-open this defect or just ask them to fix this URL? http://rhn.redhat.com/errata/RHSA-2003-334.html
------ Additional Comments From dichung.com 2003-12-17 18:49 ------- Please ignore my previous comments. I neglected to read it carefully. It states that the actually download is available only from RedHat Network.
----- Additional Comments From khoa.com 2004-03-28 17:23 ------- Red Hat reported that this fix should be in RHEL3 U2. Please re-open this bug report if this is not the case. Thanks.