Bug 132494 - POSIX Asynchronous IO support is unstable
POSIX Asynchronous IO support is unstable
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Jeffrey Moyer
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-09-13 23:40 EDT by Matthew Gregan [:kinetik]
Modified: 2007-11-30 17:07 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-04-08 09:17:28 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
POSIX AIO test case (863 bytes, text/plain)
2004-09-13 23:41 EDT, Matthew Gregan [:kinetik]
no flags Details
POSIX AIO test case (851 bytes, text/plain)
2004-09-13 23:49 EDT, Matthew Gregan [:kinetik]
no flags Details
Another test case to provoke the bug (2.88 KB, text/plain)
2005-04-06 04:40 EDT, Ole-Hjalmar Kristensen
no flags Details
Call trace when running the aiotest program on an unpatched system (15.22 KB, text/plain)
2005-04-18 04:29 EDT, Ole-Hjalmar Kristensen
no flags Details
More tracing when running aiotest (32.59 KB, text/plain)
2005-04-18 04:30 EDT, Ole-Hjalmar Kristensen
no flags Details

  None (edit)
Description Matthew Gregan [:kinetik] 2004-09-13 23:40:18 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7)
Gecko/20040803 Firefox/0.9.3

Description of problem:
POSIX Asynchronous IO support is unstable.  This is RHEL 3AS with
Update 2 applied, and the latest available RHEL 3AS kernel.

A database product I'm developing uses the POSIX Asynchronous IO
interface provided in glibc (via librt or librtkaio as I understand
it) and has been regularly hanging or crashing on my RHEL 3AS test
box.  Initially I observed the application segfaulting during AIO
operations--a number of valid operations would be performed, and then
a random (valid) IO operation would fail as aio_error() was called. 
After confirming this was not a bug within the application (note that
AIO has been working successfully on other UNIX and Linux platforms
for some time including earlier RHEL releases that used the older
librt userspace AIO implementation), I updated the kernel to
kernel-smp-2.4.21-20.EL (from 2.4.21-15.EL).  This resolved the
segfault problem, but one or more issues remain.

I've written a small test-case that will reproducibly hang inside
wait_for_all_aios, leaving the process in an uninterruptible ('D') state.


Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-20.EL

How reproducible:
Always

Steps to Reproduce:
1. Compile aiotest.c using 'cc -o aiotest aiotest.c -lrt'
2. Run aiotest: ./aiotest
3. Monitor the state of the process for a few minutes.


Actual Results:  The aiotest process will hang inside
wait_for_all_aios--usually within a minute or so.  When this occurs,
the process enters an uninterruptible ('D') state.

Expected Results:  aiotest should run, continuously overwriting the
8MB test file, until interrupted by the user.

Additional info:
Comment 1 Matthew Gregan [:kinetik] 2004-09-13 23:41:28 EDT
Created attachment 103814 [details]
POSIX AIO test case
Comment 2 Matthew Gregan [:kinetik] 2004-09-13 23:46:29 EDT
A couple of further notes:

1.  The segfault during aio_error() that I mentioned only occurred in
    32-bit code running on the x86_64 platform.

2.  The problem I'm reporting occurs when the aiotest is built as a
    32-bit (x86) or 64-bit (x86_64 binary).  The same code, built on
    another x86_64 machine running Fedora Core 2 runs fine in both 32
    and 64-bit modes.
Comment 3 Matthew Gregan [:kinetik] 2004-09-13 23:49:36 EDT
Created attachment 103815 [details]
POSIX AIO test case

Remove the sleep() I left in by accident while tracking another bug.
Comment 4 Jeffrey Moyer 2004-09-20 15:42:18 EDT
What version of glibc are you using?  I notice you're using an SMP
kernel; can I assume that means this is a multiprocessor system?  If
so, how many processors?  I have been unable to reproduce this bug,
thus far.
Comment 5 Matthew Gregan [:kinetik] 2004-09-20 17:37:34 EDT
Thanks for the response, Jeff.  I've got glibc 2.3.2-95.20 installed
(both x86_64 and i686 versions).  The machine I'm testing this on is a
2-way x86_64 box--it's quite possible that the problem is x86_64
specific (even when running the test case build using the -m32 switch
for gcc) as I don't recall seeing this problem on one of our 2-way
i686 RHEL 3AS boxes (though I'll double-check that now).
Comment 6 Matthew Gregan [:kinetik] 2004-09-20 18:34:39 EDT
It does look like it's x86_64 specific.  I've just tried the my AIO
test on a couple of i686 RHEL 3AS boxes (kernel-smp-2.4.21-4.EL,
glibc-2.3.2-95.20) and have been unable to reproduce the problem so far.
Comment 7 Jeffrey Moyer 2004-10-19 11:00:01 EDT
To clarify, what did you mean in your second point of comment 2?  I
read that as you built it on a Fedora Core system, then copied the
resultant binary over to your RHEL3 box, and it ran fine.  Is this the
case?
Comment 8 Matthew Gregan [:kinetik] 2004-10-19 15:19:44 EDT
Ah, yes, sorry--my comment #2 is a bit confusing.  Two of the cases
I've tested are: building a 32 (-m32) and 64 bit version of the
testcase on a RHEL 3AS x86_64 box and running it there, and building a
32 and 64 bit version on a x86_64 FC2 box and running it there, i.e.
in each case I am running the testcase on the same host it is compiled on.

RHEL3 AS 32-bit: fail
RHEL3 AS 64-bit: fail
FC2 32-bit: okay
FC2 64-bit: okay

I recently build a 2.6.9-rc1-mm4 kernel on the RHEL3 AS box and booted
it for testing purposes.  Using this kernel, the test case does not
exhibit the problem I'm seeing with the supported RHEL3 AS kernel, but
another problem is present: the worker thread created by librtkaio
spins (using 100% of one CPU) inside handle_kernel_aio in
librtkaio.so--this problem occurs on all architectures I've tested.  I
suspect this caused by an incompatibility between librtkaio and the
AIO system calls in the 2.6 kernel--the released SuSE Linux Enterprise
9 has the same problem, but FC2 does not due to the fact that
librtkaio is not built and shipped with FC2.
Comment 9 Ole-Hjalmar Kristensen 2005-04-06 04:38:10 EDT
We have observed the same problem on a two-cpu machine running 2.4.21-20
The problem has been observed during performance testing of Sun HADB (High
Availability Data Base), and by running Matthew Gregan's test program.
We also have our own test program which also recreates the problem. I will make
an attachment with that.


System and CPU info follows:

Linux sun1-24 2.4.21-20.ELsmp #1 SMP Thu Sep 2 23:50:15 EEST 2004 x86_64 x86_64
x86_64 GNU/Linux

/proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 5
model name      : AMD Opteron(tm) Processor 850
physical id     : 0
siblings        : 1
stepping        : 10
cpu MHz         : 2387.476
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips        : 4757.91
TLB size        : 1088 4K pages
clflush size    : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 5
model name      : AMD Opteron(tm) Processor 850
physical id     : 0
siblings        : 1
stepping        : 10
cpu MHz         : 2387.476
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips        : 4771.02
TLB size        : 1088 4K pages
clflush size    : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp 
Comment 10 Ole-Hjalmar Kristensen 2005-04-06 04:40:55 EDT
Created attachment 112750 [details]
Another test case to provoke the bug

Process will hang and be uninterruptable after a few minutes
Comment 11 Jeffrey Moyer 2005-04-06 10:14:31 EDT
Can you look at the sysrq-t output when the problem occurs?  If the system is
using in kernel aio routines (via librtkaio), then you could be running into a
race condition that has been fixed in the kernel.  U5 will have a fix for this
in the kernel.

Could you try a newer kernel?  I've uploaded an x86_64 smp kernel to:

  http://people.redhat.com/jmoyer/.bz132494/

Thanks.
Comment 12 Jeffrey Moyer 2005-04-06 17:48:48 EDT
Thanks for the reproducer(s).  I've recreated the problem on my x86_64 with
2.4.21-20.  I then upgraded to 2.4.21-31, and the problem hasn't occurred since.
 Please test this new version.  If it works for you, I will close this bug as a
duplicate of bug #138905.
Comment 13 Jeffrey Moyer 2005-04-08 09:17:28 EDT
My tests have been running successfully for over 2 days, now.  I'm closing this
as a duplicate of 138905.

*** This bug has been marked as a duplicate of 138905 ***
Comment 14 Ernie Petrides 2005-04-08 15:34:14 EDT
A fix for this problem was committed to the RHEL3 U5 patch pool
on 6-Jan-2005 (in kernel version 2.4.21-27.6.EL).
Comment 15 Ole-Hjalmar Kristensen 2005-04-18 04:29:11 EDT
Created attachment 113315 [details]
Call trace when running the aiotest program on an unpatched system
Comment 16 Ole-Hjalmar Kristensen 2005-04-18 04:30:59 EDT
Created attachment 113316 [details]
More tracing when running aiotest
Comment 17 Ole-Hjalmar Kristensen 2005-04-18 04:35:49 EDT
The patched kernel seems to fix our problem. I have created a couple of
attachments which shows the output from sysrq t when our customer ran aiotest on
the kernel which caused the problems.
Comment 18 Tim Powers 2005-05-18 09:28:07 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html
Comment 19 Alexey Roytman 2005-05-23 07:20:03 EDT
Does RHEL 2.1AS has the same issue?
(kernel: 2.4.9-e.62smp, glibc: glibc-2.2.4-32.20)
We've experienced a strange behavior of AIO on RHEL 3 (aio_return returned 0)
that was fixed by the errata above, but for 2.1 there is no fix.

The bug is similar to bug #107015.

Note You need to log in before you can comment on or make changes to this bug.