512100 – libaio reports success and returns incorrect data when interleaved with fork()

Bug 512100 - libaio reports success and returns incorrect data when interleaved with fork()

Summary: libaio reports success and returns incorrect data when interleaved with fork()

Keywords:
Status:	CLOSED DUPLICATE of bug 471613
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jeff Moyer
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-07-16 10:36 UTC by Marek Dopiera
Modified:	2009-07-17 13:36 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-07-17 13:36:06 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Test program to reproduce the bug. (2.42 KB, text/plain) 2009-07-16 10:36 UTC, Marek Dopiera	no flags	Details
View All

Description Marek Dopiera 2009-07-16 10:36:54 UTC

Created attachment 353967 [details]
Test program to reproduce the bug.

Description of problem:
Reads scheduled with libaio sometimes fail when fork()ing simultaneously despite success is being reported.

Version-Release number of selected component (if applicable):
Checked on Centos 5.2.
uname -a
Linux w-50 2.6.18-128.2.1.el5 #1 SMP Tue Jul 14 06:36:37 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
libaio: 0.3.106
and on
Red Hat Enterprise Linux Server release 5.1 (Tikanga)
uname -a:
Linux beta12 2.6.18-92.1.6.1 #1 SMP Mon Aug 25 12:16:35 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
libaio: 0.3.106

How reproducible:
Interleave fork()'s with submitting reads through libaio. There is a test program attached, which reproduces the bug.

Its input should be a 4Gb file filled with zeros. It launches 2 new threads, which do:
1:
keeps calling fork() and wait()

2:
allocates a buffer, fills it with 0xff and submits reads from the test file with a callback set on completion (loops until the end of file)

3:
calls io_queue_run() in a loop to launch callbacks; callbacks check if (in case of no error reported) the buffer contains zeros and exit()s if the condition is not met (with a "corruption found" message")

Steps to Reproduce:
1. gcc -o aiotest aiotest.c -lpthread -laio
2. dd if=/dev/zero of=test bs=1048576 count=4096 oflag=direct
3. ./aiotest test
  
Actual results:
Terminates with a message "corruption found" (after a nondeterministic amount of time, usually a second).

Expected results:
Having read the whole 4Gb it should fall into an infinite loop forking and waiting.

Additional info:
When the loop with fork()'s is replaced with an empty loop, everything works correctly.

Comment 1 Jeff Moyer 2009-07-16 13:43:08 UTC

This sounds like a duplicate of bug 471613.  Please try a kernel from here:
  http://people.redhat.com/dzickus/el5/158.el5/

Thanks,
Jeff

Comment 2 Marek Dopiera 2009-07-17 07:30:46 UTC

That's correct, the bug you mention is a duplicate, the kernel from the link above works correctly.

Thanks

Comment 3 Jeff Moyer 2009-07-17 13:36:06 UTC

OK, thanks for taking the time to confirm that, and thanks also for the reproducer.  While we are addressing this for RHEL 5, the upstream status of this patch is unclear.  Linus has rejected the fixes that we are incorporating, and I'm not sure at this point whether the bug will be fixed at all in upstream kernels.

If you're looking for ways to avoid this in the future, I would suggest not mixing fork(), pthreads and direct I/O to buffers that are smaller than the host system's page size.  In other words, if your reproducer program used getpagesize() sized pages instead of 512 byte pages, you wouldn't see this problem.

*** This bug has been marked as a duplicate of bug 471613 ***

Note You need to log in before you can comment on or make changes to this bug.