From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Description of problem: When using pthread_kill and signal handling to perform thread suspension we get an unexplained dead-lock. Happens for both ia32 and x86_64 compiled code. E.g.: Suspender thread runs... // ... ensure suspendee running... pthread_kill(suspendee, suspendSignal) While suspendee threads run... setupSigAltStack(); notifyRunning(); while (notSuspendedEnough()) sched_yield(); Version-Release number of selected component (if applicable): kernel-2.6.9-1.906_EL How reproducible: Always Steps to Reproduce: Will attach repro in which the main thread signals a number of threads whom acknowledge then wait until signalled again. 1. gcc -g -Wall -lpthread -o susphello susphello.c 2. ./susphello 3. Wait less than a minute for it to lock up Actual Results: Deadlocks after a random amount of time, normal less than 10 seconds. No doubt h/w dependent, was using a two-way with hyperthreading. Upon deadlock the thread we are waiting for shows the signal is pending (via procfs/ps) and both procfs and gdb show the thread is in a system call (or least boundary). WCHAN shows "-" and "sys-rq trace" shows RUNNING (user code). Expected Results: The suspendee should receive the suspend signal and acknowledge, with either sem_post or pthread_kill (defined in test case) Additional info: uname: 2.6.9-1.906_ELsmp #1 SMP Sun Dec 12 23:05:02 EST 2004 x86_64 x86_64 x86_64 GNU/Linux rpm -q --queryformat '\n%{NAME}-%{VERSION}-%{RELEASE}.%{ARCH}\n' glibc: glibc-2.3.4-2.i686 glibc-2.3.4-2.x86_64
Created attachment 110564 [details] Repro gcc -g -Wall -lpthread -o susphello susphello.c && ./susphello
I'm not able to reproduce on my HT IA32 box, but am able to reproduce readily on 4-way x86_64 (EM64T) box. Both boxes are running kernel-2.6.9-5.EL and glibc-2.3.4-2. Another bit of data is that transferring the 32-bit susphello to the x86_64 machine and running that results in the lock as well.
Created attachment 110755 [details] Patch adding the missing "lock" prefix Attached patch seems to fix the issue. Will post the patch to upstream kernel aswell.
I have verified that this patch resolves the problem demonstrated by the repro case. Thanks, Suresh.
Folks at BEA, this is slated for inclusion in U1 Beta. Please reply with your testing of this particular item once we make the Beta available to you, thanks.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-420.html