Bug 138905 - Unkillable processes under 64bit Linux which use Kernel Asynchronous I/O
Unkillable processes under 64bit Linux which use Kernel Asynchronous I/O
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Jeffrey Moyer
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-11-11 16:27 EST by Wim ten Have
Modified: 2007-11-30 17:07 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-05-18 09:28:30 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
fix race condition in add_wait_queue_cond (444 bytes, patch)
2004-12-10 11:07 EST, Jeffrey Moyer
no flags Details | Diff

  None (edit)
Description Wim ten Have 2004-11-11 16:27:10 EST
Description of problem:
Running our database server on 64bit linux distributions leads to a
rather nasty problem when Kernel Asynchronous I/O is enabled.  This
could mean that we run our 32bit product on 64bit distributions as
done for the AMD Opteron or XEON2 but it also reproduces easily when
running our 64bit product ported for the IBM Pseries PowerPC architecture.

Our database server queues per aio_read() and aio_write() operations
translated per /lib/tls/librtkaio.so to the Linux KAIO io_*() API.

Any so often when our database server is at a specific idle state or
when it is known that multiple Asynchronous I/O's are queued we poll
our queued aiocb's with aio_error() to obtain the specific call I/O
return values.  

Soon after queueing many I/O's into the o/s aio_error() returns -1
where errno indicates EINPROGRESS.  Our poll structure never escapes
from, ie the I/O's seem never to complete and EINPROGRESS is returned
for ever.

If you try to SIGKILL (kill -9) the process responsible you will
finally find that it turns Zombie locked in the kernel and unkillable.
Ie a system reboot is required to remove the process from the platform.

Version-Release number of selected component (if applicable):
Linux RHEL 3.0
Red Hat Enterprise Linux WS release 3 (Taroon Update 3)
Kernel 2.4.21-20.ELsmp on an x86_64
Linux aseamd2 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:34:58 EDT 2004
x86_64 x86_64 x86_64 GNU/Linux
rpm -qf /lib/tls/librtkaio-2.3.2.so
glibc-2.3.2-95.27

How reproducible:
Please contact Ghaf Toorani <gtoorani@sybase.com> to obtain a small
distribution for our product which includes a small reproduction
script.  This could be for Intel (AMD Opteron) or IBM Pseries (PowerPC)

Steps to Reproduce:
1. Install product 
2. Run shell script
3. kill -9 the aio_error() -> EINPROGRESS process
  
Actual results:
Kernel Asynchronous I/O under 64bit implementations does not work.

Expected results:


Additional info:
Please contact Ghaf Toorani <gtoorani@sybase.com> to obtain additional
results.
Comment 1 Jeffrey Moyer 2004-11-12 17:00:52 EST
I sent mail asking for the reproducer.  Any further info should be posted to the
bugzilla, not communicated in private email.
Comment 2 Jeffrey Moyer 2004-12-10 11:07:27 EST
Created attachment 108324 [details]
fix race condition in add_wait_queue_cond

The attached patch fixes the problem for me.  Please verify this works in your
environment.  I'll be posting the patch for internal review shortly.
Comment 3 Ernie Petrides 2005-01-06 14:22:04 EST
A fix for this problem has just been committed to the RHEL3 U5
patch pool this afternoon (in kernel version 2.4.21-27.6.EL).
Comment 4 Jeffrey Moyer 2005-04-08 09:17:52 EDT
*** Bug 132494 has been marked as a duplicate of this bug. ***
Comment 5 Tim Powers 2005-05-18 09:28:30 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html

Note You need to log in before you can comment on or make changes to this bug.