Red Hat Bugzilla – Bug 138905
Unkillable processes under 64bit Linux which use Kernel Asynchronous I/O
Last modified: 2007-11-30 17:07:05 EST
Description of problem:
Running our database server on 64bit linux distributions leads to a
rather nasty problem when Kernel Asynchronous I/O is enabled. This
could mean that we run our 32bit product on 64bit distributions as
done for the AMD Opteron or XEON2 but it also reproduces easily when
running our 64bit product ported for the IBM Pseries PowerPC architecture.
Our database server queues per aio_read() and aio_write() operations
translated per /lib/tls/librtkaio.so to the Linux KAIO io_*() API.
Any so often when our database server is at a specific idle state or
when it is known that multiple Asynchronous I/O's are queued we poll
our queued aiocb's with aio_error() to obtain the specific call I/O
Soon after queueing many I/O's into the o/s aio_error() returns -1
where errno indicates EINPROGRESS. Our poll structure never escapes
from, ie the I/O's seem never to complete and EINPROGRESS is returned
If you try to SIGKILL (kill -9) the process responsible you will
finally find that it turns Zombie locked in the kernel and unkillable.
Ie a system reboot is required to remove the process from the platform.
Version-Release number of selected component (if applicable):
Linux RHEL 3.0
Red Hat Enterprise Linux WS release 3 (Taroon Update 3)
Kernel 2.4.21-20.ELsmp on an x86_64
Linux aseamd2 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:34:58 EDT 2004
x86_64 x86_64 x86_64 GNU/Linux
rpm -qf /lib/tls/librtkaio-2.3.2.so
Please contact Ghaf Toorani <firstname.lastname@example.org> to obtain a small
distribution for our product which includes a small reproduction
script. This could be for Intel (AMD Opteron) or IBM Pseries (PowerPC)
Steps to Reproduce:
1. Install product
2. Run shell script
3. kill -9 the aio_error() -> EINPROGRESS process
Kernel Asynchronous I/O under 64bit implementations does not work.
Please contact Ghaf Toorani <email@example.com> to obtain additional
I sent mail asking for the reproducer. Any further info should be posted to the
bugzilla, not communicated in private email.
Created attachment 108324 [details]
fix race condition in add_wait_queue_cond
The attached patch fixes the problem for me. Please verify this works in your
environment. I'll be posting the patch for internal review shortly.
A fix for this problem has just been committed to the RHEL3 U5
patch pool this afternoon (in kernel version 2.4.21-27.6.EL).
*** Bug 132494 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.