Description of problem: Running our database server on 64bit linux distributions leads to a rather nasty problem when Kernel Asynchronous I/O is enabled. This could mean that we run our 32bit product on 64bit distributions as done for the AMD Opteron or XEON2 but it also reproduces easily when running our 64bit product ported for the IBM Pseries PowerPC architecture. Our database server queues per aio_read() and aio_write() operations translated per /lib/tls/librtkaio.so to the Linux KAIO io_*() API. Any so often when our database server is at a specific idle state or when it is known that multiple Asynchronous I/O's are queued we poll our queued aiocb's with aio_error() to obtain the specific call I/O return values. Soon after queueing many I/O's into the o/s aio_error() returns -1 where errno indicates EINPROGRESS. Our poll structure never escapes from, ie the I/O's seem never to complete and EINPROGRESS is returned for ever. If you try to SIGKILL (kill -9) the process responsible you will finally find that it turns Zombie locked in the kernel and unkillable. Ie a system reboot is required to remove the process from the platform. Version-Release number of selected component (if applicable): Linux RHEL 3.0 Red Hat Enterprise Linux WS release 3 (Taroon Update 3) Kernel 2.4.21-20.ELsmp on an x86_64 Linux aseamd2 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:34:58 EDT 2004 x86_64 x86_64 x86_64 GNU/Linux rpm -qf /lib/tls/librtkaio-2.3.2.so glibc-2.3.2-95.27 How reproducible: Please contact Ghaf Toorani <gtoorani> to obtain a small distribution for our product which includes a small reproduction script. This could be for Intel (AMD Opteron) or IBM Pseries (PowerPC) Steps to Reproduce: 1. Install product 2. Run shell script 3. kill -9 the aio_error() -> EINPROGRESS process Actual results: Kernel Asynchronous I/O under 64bit implementations does not work. Expected results: Additional info: Please contact Ghaf Toorani <gtoorani> to obtain additional results.
I sent mail asking for the reproducer. Any further info should be posted to the bugzilla, not communicated in private email.
Created attachment 108324 [details] fix race condition in add_wait_queue_cond The attached patch fixes the problem for me. Please verify this works in your environment. I'll be posting the patch for internal review shortly.
A fix for this problem has just been committed to the RHEL3 U5 patch pool this afternoon (in kernel version 2.4.21-27.6.EL).
*** Bug 132494 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html