Red Hat Bugzilla – Bug 132494
POSIX Asynchronous IO support is unstable
Last modified: 2007-11-30 17:07:04 EST
From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040803 Firefox/0.9.3 Description of problem: POSIX Asynchronous IO support is unstable. This is RHEL 3AS with Update 2 applied, and the latest available RHEL 3AS kernel. A database product I'm developing uses the POSIX Asynchronous IO interface provided in glibc (via librt or librtkaio as I understand it) and has been regularly hanging or crashing on my RHEL 3AS test box. Initially I observed the application segfaulting during AIO operations--a number of valid operations would be performed, and then a random (valid) IO operation would fail as aio_error() was called. After confirming this was not a bug within the application (note that AIO has been working successfully on other UNIX and Linux platforms for some time including earlier RHEL releases that used the older librt userspace AIO implementation), I updated the kernel to kernel-smp-2.4.21-20.EL (from 2.4.21-15.EL). This resolved the segfault problem, but one or more issues remain. I've written a small test-case that will reproducibly hang inside wait_for_all_aios, leaving the process in an uninterruptible ('D') state. Version-Release number of selected component (if applicable): kernel-smp-2.4.21-20.EL How reproducible: Always Steps to Reproduce: 1. Compile aiotest.c using 'cc -o aiotest aiotest.c -lrt' 2. Run aiotest: ./aiotest 3. Monitor the state of the process for a few minutes. Actual Results: The aiotest process will hang inside wait_for_all_aios--usually within a minute or so. When this occurs, the process enters an uninterruptible ('D') state. Expected Results: aiotest should run, continuously overwriting the 8MB test file, until interrupted by the user. Additional info:
Created attachment 103814 [details] POSIX AIO test case
A couple of further notes: 1. The segfault during aio_error() that I mentioned only occurred in 32-bit code running on the x86_64 platform. 2. The problem I'm reporting occurs when the aiotest is built as a 32-bit (x86) or 64-bit (x86_64 binary). The same code, built on another x86_64 machine running Fedora Core 2 runs fine in both 32 and 64-bit modes.
Created attachment 103815 [details] POSIX AIO test case Remove the sleep() I left in by accident while tracking another bug.
What version of glibc are you using? I notice you're using an SMP kernel; can I assume that means this is a multiprocessor system? If so, how many processors? I have been unable to reproduce this bug, thus far.
Thanks for the response, Jeff. I've got glibc 2.3.2-95.20 installed (both x86_64 and i686 versions). The machine I'm testing this on is a 2-way x86_64 box--it's quite possible that the problem is x86_64 specific (even when running the test case build using the -m32 switch for gcc) as I don't recall seeing this problem on one of our 2-way i686 RHEL 3AS boxes (though I'll double-check that now).
It does look like it's x86_64 specific. I've just tried the my AIO test on a couple of i686 RHEL 3AS boxes (kernel-smp-2.4.21-4.EL, glibc-2.3.2-95.20) and have been unable to reproduce the problem so far.
To clarify, what did you mean in your second point of comment 2? I read that as you built it on a Fedora Core system, then copied the resultant binary over to your RHEL3 box, and it ran fine. Is this the case?
Ah, yes, sorry--my comment #2 is a bit confusing. Two of the cases I've tested are: building a 32 (-m32) and 64 bit version of the testcase on a RHEL 3AS x86_64 box and running it there, and building a 32 and 64 bit version on a x86_64 FC2 box and running it there, i.e. in each case I am running the testcase on the same host it is compiled on. RHEL3 AS 32-bit: fail RHEL3 AS 64-bit: fail FC2 32-bit: okay FC2 64-bit: okay I recently build a 2.6.9-rc1-mm4 kernel on the RHEL3 AS box and booted it for testing purposes. Using this kernel, the test case does not exhibit the problem I'm seeing with the supported RHEL3 AS kernel, but another problem is present: the worker thread created by librtkaio spins (using 100% of one CPU) inside handle_kernel_aio in librtkaio.so--this problem occurs on all architectures I've tested. I suspect this caused by an incompatibility between librtkaio and the AIO system calls in the 2.6 kernel--the released SuSE Linux Enterprise 9 has the same problem, but FC2 does not due to the fact that librtkaio is not built and shipped with FC2.
We have observed the same problem on a two-cpu machine running 2.4.21-20 The problem has been observed during performance testing of Sun HADB (High Availability Data Base), and by running Matthew Gregan's test program. We also have our own test program which also recreates the problem. I will make an attachment with that. System and CPU info follows: Linux sun1-24 2.4.21-20.ELsmp #1 SMP Thu Sep 2 23:50:15 EEST 2004 x86_64 x86_64 x86_64 GNU/Linux /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 850 physical id : 0 siblings : 1 stepping : 10 cpu MHz : 2387.476 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 4757.91 TLB size : 1088 4K pages clflush size : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 850 physical id : 0 siblings : 1 stepping : 10 cpu MHz : 2387.476 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 4771.02 TLB size : 1088 4K pages clflush size : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp
Created attachment 112750 [details] Another test case to provoke the bug Process will hang and be uninterruptable after a few minutes
Can you look at the sysrq-t output when the problem occurs? If the system is using in kernel aio routines (via librtkaio), then you could be running into a race condition that has been fixed in the kernel. U5 will have a fix for this in the kernel. Could you try a newer kernel? I've uploaded an x86_64 smp kernel to: http://people.redhat.com/jmoyer/.bz132494/ Thanks.
Thanks for the reproducer(s). I've recreated the problem on my x86_64 with 2.4.21-20. I then upgraded to 2.4.21-31, and the problem hasn't occurred since. Please test this new version. If it works for you, I will close this bug as a duplicate of bug #138905.
My tests have been running successfully for over 2 days, now. I'm closing this as a duplicate of 138905. *** This bug has been marked as a duplicate of 138905 ***
A fix for this problem was committed to the RHEL3 U5 patch pool on 6-Jan-2005 (in kernel version 2.4.21-27.6.EL).
Created attachment 113315 [details] Call trace when running the aiotest program on an unpatched system
Created attachment 113316 [details] More tracing when running aiotest
The patched kernel seems to fix our problem. I have created a couple of attachments which shows the output from sysrq t when our customer ran aiotest on the kernel which caused the problems.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html
Does RHEL 2.1AS has the same issue? (kernel: 2.4.9-e.62smp, glibc: glibc-2.2.4-32.20) We've experienced a strange behavior of AIO on RHEL 3 (aio_return returned 0) that was fixed by the errata above, but for 2.1 there is no fix. The bug is similar to bug #107015.