Bug 146911 - Thread suspension via async signal fails on rhel4-rc2
Summary: Thread suspension via async signal fails on rhel4-rc2
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Ingo Molnar
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-02-02 18:10 UTC by David Simms
Modified: 2007-11-30 22:07 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-06-08 15:13:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Repro (8.27 KB, text/plain)
2005-02-02 18:12 UTC, David Simms
no flags Details
Patch adding the missing "lock" prefix (566 bytes, patch)
2005-02-07 21:11 UTC, Suresh Siddha
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:420 0 normal SHIPPED_LIVE Important: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 1 2005-06-08 04:00:00 UTC

Description David Simms 2005-02-02 18:10:57 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5)
Gecko/20041107 Firefox/1.0

Description of problem:
When using pthread_kill and signal handling to perform thread
suspension we get an unexplained dead-lock. Happens for both ia32 and
x86_64 compiled code.

E.g.:

Suspender thread runs...

// ... ensure suspendee running...
pthread_kill(suspendee, suspendSignal)

While suspendee threads run...

   setupSigAltStack();
   notifyRunning();
   while (notSuspendedEnough())
     sched_yield();


Version-Release number of selected component (if applicable):
kernel-2.6.9-1.906_EL

How reproducible:
Always

Steps to Reproduce:
Will attach repro in which the main thread signals a number of threads
whom acknowledge then wait until signalled again.

1. gcc -g -Wall -lpthread -o susphello susphello.c
2. ./susphello
3. Wait less than a minute for it to lock up
    

Actual Results:  Deadlocks after a random amount of time, normal less
than 10 seconds. No doubt h/w dependent, was using a two-way with
hyperthreading.

Upon deadlock the thread we are waiting for shows the signal is
pending (via procfs/ps) and both procfs and gdb show the thread is in
a system call (or least boundary). WCHAN shows "-" and "sys-rq trace"
shows RUNNING (user code).



Expected Results:  The suspendee should receive the suspend signal and
acknowledge, with either sem_post or pthread_kill (defined in test case)

Additional info:


uname: 2.6.9-1.906_ELsmp #1 SMP Sun Dec 12 23:05:02 EST 2004 x86_64
x86_64 x86_64 GNU/Linux

rpm -q --queryformat '\n%{NAME}-%{VERSION}-%{RELEASE}.%{ARCH}\n' glibc:

glibc-2.3.4-2.i686 
glibc-2.3.4-2.x86_64

Comment 1 David Simms 2005-02-02 18:12:17 UTC
Created attachment 110564 [details]
Repro

gcc -g -Wall -lpthread -o susphello susphello.c && ./susphello

Comment 2 Jay Turner 2005-02-03 08:12:31 UTC
I'm not able to reproduce on my HT IA32 box, but am able to reproduce readily on
4-way x86_64 (EM64T) box.  Both boxes are running kernel-2.6.9-5.EL and
glibc-2.3.4-2.

Another bit of data is that transferring the 32-bit susphello to the x86_64
machine and running that results in the lock as well.

Comment 5 Suresh Siddha 2005-02-07 21:11:27 UTC
Created attachment 110755 [details]
Patch adding the missing "lock" prefix

Attached patch seems to fix the issue. Will post the patch to upstream kernel
aswell.

Comment 8 Johan Walles 2005-02-11 17:18:34 UTC
I have verified that this patch resolves the problem demonstrated by
the repro case.  Thanks, Suresh.


Comment 13 Bob Johnson 2005-03-01 17:39:02 UTC
Folks at BEA, this is slated for inclusion in U1 Beta.
Please reply with your testing of this particular item once we make
the Beta available to you, thanks.

Comment 15 Tim Powers 2005-06-08 15:13:43 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-420.html



Note You need to log in before you can comment on or make changes to this bug.