Bug 138905 - Unkillable processes under 64bit Linux which use Kernel Asynchronous I/O
Summary: Unkillable processes under 64bit Linux which use Kernel Asynchronous I/O
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Jeff Moyer
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-11-11 21:27 UTC by Wim ten Have
Modified: 2007-11-30 22:07 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-05-18 13:28:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fix race condition in add_wait_queue_cond (444 bytes, patch)
2004-12-10 16:07 UTC, Jeff Moyer
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:294 0 normal SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 5 2005-05-18 04:00:00 UTC

Description Wim ten Have 2004-11-11 21:27:10 UTC
Description of problem:
Running our database server on 64bit linux distributions leads to a
rather nasty problem when Kernel Asynchronous I/O is enabled.  This
could mean that we run our 32bit product on 64bit distributions as
done for the AMD Opteron or XEON2 but it also reproduces easily when
running our 64bit product ported for the IBM Pseries PowerPC architecture.

Our database server queues per aio_read() and aio_write() operations
translated per /lib/tls/librtkaio.so to the Linux KAIO io_*() API.

Any so often when our database server is at a specific idle state or
when it is known that multiple Asynchronous I/O's are queued we poll
our queued aiocb's with aio_error() to obtain the specific call I/O
return values.  

Soon after queueing many I/O's into the o/s aio_error() returns -1
where errno indicates EINPROGRESS.  Our poll structure never escapes
from, ie the I/O's seem never to complete and EINPROGRESS is returned
for ever.

If you try to SIGKILL (kill -9) the process responsible you will
finally find that it turns Zombie locked in the kernel and unkillable.
Ie a system reboot is required to remove the process from the platform.

Version-Release number of selected component (if applicable):
Linux RHEL 3.0
Red Hat Enterprise Linux WS release 3 (Taroon Update 3)
Kernel 2.4.21-20.ELsmp on an x86_64
Linux aseamd2 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:34:58 EDT 2004
x86_64 x86_64 x86_64 GNU/Linux
rpm -qf /lib/tls/librtkaio-2.3.2.so
glibc-2.3.2-95.27

How reproducible:
Please contact Ghaf Toorani <gtoorani> to obtain a small
distribution for our product which includes a small reproduction
script.  This could be for Intel (AMD Opteron) or IBM Pseries (PowerPC)

Steps to Reproduce:
1. Install product 
2. Run shell script
3. kill -9 the aio_error() -> EINPROGRESS process
  
Actual results:
Kernel Asynchronous I/O under 64bit implementations does not work.

Expected results:


Additional info:
Please contact Ghaf Toorani <gtoorani> to obtain additional
results.

Comment 1 Jeff Moyer 2004-11-12 22:00:52 UTC
I sent mail asking for the reproducer.  Any further info should be posted to the
bugzilla, not communicated in private email.

Comment 2 Jeff Moyer 2004-12-10 16:07:27 UTC
Created attachment 108324 [details]
fix race condition in add_wait_queue_cond

The attached patch fixes the problem for me.  Please verify this works in your
environment.  I'll be posting the patch for internal review shortly.

Comment 3 Ernie Petrides 2005-01-06 19:22:04 UTC
A fix for this problem has just been committed to the RHEL3 U5
patch pool this afternoon (in kernel version 2.4.21-27.6.EL).


Comment 4 Jeff Moyer 2005-04-08 13:17:52 UTC
*** Bug 132494 has been marked as a duplicate of this bug. ***

Comment 5 Tim Powers 2005-05-18 13:28:30 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html



Note You need to log in before you can comment on or make changes to this bug.