Bug 146911 - Thread suspension via async signal fails on rhel4-rc2
Thread suspension via async signal fails on rhel4-rc2
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Ingo Molnar
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-02-02 13:10 EST by David Simms
Modified: 2007-11-30 17:07 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-06-08 11:13:42 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Repro (8.27 KB, text/plain)
2005-02-02 13:12 EST, David Simms
no flags Details
Patch adding the missing "lock" prefix (566 bytes, patch)
2005-02-07 16:11 EST, Suresh Siddha
no flags Details | Diff

  None (edit)
Description David Simms 2005-02-02 13:10:57 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5)
Gecko/20041107 Firefox/1.0

Description of problem:
When using pthread_kill and signal handling to perform thread
suspension we get an unexplained dead-lock. Happens for both ia32 and
x86_64 compiled code.

E.g.:

Suspender thread runs...

// ... ensure suspendee running...
pthread_kill(suspendee, suspendSignal)

While suspendee threads run...

   setupSigAltStack();
   notifyRunning();
   while (notSuspendedEnough())
     sched_yield();


Version-Release number of selected component (if applicable):
kernel-2.6.9-1.906_EL

How reproducible:
Always

Steps to Reproduce:
Will attach repro in which the main thread signals a number of threads
whom acknowledge then wait until signalled again.

1. gcc -g -Wall -lpthread -o susphello susphello.c
2. ./susphello
3. Wait less than a minute for it to lock up
    

Actual Results:  Deadlocks after a random amount of time, normal less
than 10 seconds. No doubt h/w dependent, was using a two-way with
hyperthreading.

Upon deadlock the thread we are waiting for shows the signal is
pending (via procfs/ps) and both procfs and gdb show the thread is in
a system call (or least boundary). WCHAN shows "-" and "sys-rq trace"
shows RUNNING (user code).



Expected Results:  The suspendee should receive the suspend signal and
acknowledge, with either sem_post or pthread_kill (defined in test case)

Additional info:


uname: 2.6.9-1.906_ELsmp #1 SMP Sun Dec 12 23:05:02 EST 2004 x86_64
x86_64 x86_64 GNU/Linux

rpm -q --queryformat '\n%{NAME}-%{VERSION}-%{RELEASE}.%{ARCH}\n' glibc:

glibc-2.3.4-2.i686 
glibc-2.3.4-2.x86_64
Comment 1 David Simms 2005-02-02 13:12:17 EST
Created attachment 110564 [details]
Repro

gcc -g -Wall -lpthread -o susphello susphello.c && ./susphello
Comment 2 Jay Turner 2005-02-03 03:12:31 EST
I'm not able to reproduce on my HT IA32 box, but am able to reproduce readily on
4-way x86_64 (EM64T) box.  Both boxes are running kernel-2.6.9-5.EL and
glibc-2.3.4-2.

Another bit of data is that transferring the 32-bit susphello to the x86_64
machine and running that results in the lock as well.
Comment 5 Suresh Siddha 2005-02-07 16:11:27 EST
Created attachment 110755 [details]
Patch adding the missing "lock" prefix

Attached patch seems to fix the issue. Will post the patch to upstream kernel
aswell.
Comment 8 Johan Walles 2005-02-11 12:18:34 EST
I have verified that this patch resolves the problem demonstrated by
the repro case.  Thanks, Suresh.
Comment 13 Bob Johnson 2005-03-01 12:39:02 EST
Folks at BEA, this is slated for inclusion in U1 Beta.
Please reply with your testing of this particular item once we make
the Beta available to you, thanks.
Comment 15 Tim Powers 2005-06-08 11:13:43 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-420.html

Note You need to log in before you can comment on or make changes to this bug.