Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 3 product line. The current stable release is 3.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 138905

Summary: Unkillable processes under 64bit Linux which use Kernel Asynchronous I/O
Product: Red Hat Enterprise Linux 3 Reporter: Wim ten Have <wtenhave>
Component: kernelAssignee: Jeff Moyer <jmoyer>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: jos, kinetik, petrides, riel, wtenhave
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-05-18 13:28:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
fix race condition in add_wait_queue_cond none

Description Wim ten Have 2004-11-11 21:27:10 UTC
Description of problem:
Running our database server on 64bit linux distributions leads to a
rather nasty problem when Kernel Asynchronous I/O is enabled.  This
could mean that we run our 32bit product on 64bit distributions as
done for the AMD Opteron or XEON2 but it also reproduces easily when
running our 64bit product ported for the IBM Pseries PowerPC architecture.

Our database server queues per aio_read() and aio_write() operations
translated per /lib/tls/librtkaio.so to the Linux KAIO io_*() API.

Any so often when our database server is at a specific idle state or
when it is known that multiple Asynchronous I/O's are queued we poll
our queued aiocb's with aio_error() to obtain the specific call I/O
return values.  

Soon after queueing many I/O's into the o/s aio_error() returns -1
where errno indicates EINPROGRESS.  Our poll structure never escapes
from, ie the I/O's seem never to complete and EINPROGRESS is returned
for ever.

If you try to SIGKILL (kill -9) the process responsible you will
finally find that it turns Zombie locked in the kernel and unkillable.
Ie a system reboot is required to remove the process from the platform.

Version-Release number of selected component (if applicable):
Linux RHEL 3.0
Red Hat Enterprise Linux WS release 3 (Taroon Update 3)
Kernel 2.4.21-20.ELsmp on an x86_64
Linux aseamd2 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:34:58 EDT 2004
x86_64 x86_64 x86_64 GNU/Linux
rpm -qf /lib/tls/librtkaio-2.3.2.so
glibc-2.3.2-95.27

How reproducible:
Please contact Ghaf Toorani <gtoorani> to obtain a small
distribution for our product which includes a small reproduction
script.  This could be for Intel (AMD Opteron) or IBM Pseries (PowerPC)

Steps to Reproduce:
1. Install product 
2. Run shell script
3. kill -9 the aio_error() -> EINPROGRESS process
  
Actual results:
Kernel Asynchronous I/O under 64bit implementations does not work.

Expected results:


Additional info:
Please contact Ghaf Toorani <gtoorani> to obtain additional
results.

Comment 1 Jeff Moyer 2004-11-12 22:00:52 UTC
I sent mail asking for the reproducer.  Any further info should be posted to the
bugzilla, not communicated in private email.

Comment 2 Jeff Moyer 2004-12-10 16:07:27 UTC
Created attachment 108324 [details]
fix race condition in add_wait_queue_cond

The attached patch fixes the problem for me.  Please verify this works in your
environment.  I'll be posting the patch for internal review shortly.

Comment 3 Ernie Petrides 2005-01-06 19:22:04 UTC
A fix for this problem has just been committed to the RHEL3 U5
patch pool this afternoon (in kernel version 2.4.21-27.6.EL).


Comment 4 Jeff Moyer 2005-04-08 13:17:52 UTC
*** Bug 132494 has been marked as a duplicate of this bug. ***

Comment 5 Tim Powers 2005-05-18 13:28:30 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html