Bug 512100

Summary:

libaio reports success and returns incorrect data when interleaved with fork()

Product:

Red Hat Enterprise Linux 5

Reporter:

Marek Dopiera <marek>

Component:

kernel

Assignee:

Jeff Moyer <jmoyer>

Status:

CLOSED DUPLICATE

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

medium

Docs Contact:

Priority:

low

Version:

5.2

CC:

dzickus, l_heldt, lwoodman, marek

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-07-17 13:36:06 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Test program to reproduce the bug.	none

Description Marek Dopiera 2009-07-16 10:36:54 UTC

Created attachment 353967 [details]
Test program to reproduce the bug.

Description of problem:
Reads scheduled with libaio sometimes fail when fork()ing simultaneously despite success is being reported.

Version-Release number of selected component (if applicable):
Checked on Centos 5.2.
uname -a
Linux w-50 2.6.18-128.2.1.el5 #1 SMP Tue Jul 14 06:36:37 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
libaio: 0.3.106
and on
Red Hat Enterprise Linux Server release 5.1 (Tikanga)
uname -a:
Linux beta12 2.6.18-92.1.6.1 #1 SMP Mon Aug 25 12:16:35 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
libaio: 0.3.106

How reproducible:
Interleave fork()'s with submitting reads through libaio. There is a test program attached, which reproduces the bug.

Its input should be a 4Gb file filled with zeros. It launches 2 new threads, which do:
1:
keeps calling fork() and wait()

2:
allocates a buffer, fills it with 0xff and submits reads from the test file with a callback set on completion (loops until the end of file)

3:
calls io_queue_run() in a loop to launch callbacks; callbacks check if (in case of no error reported) the buffer contains zeros and exit()s if the condition is not met (with a "corruption found" message")

Steps to Reproduce:
1. gcc -o aiotest aiotest.c -lpthread -laio
2. dd if=/dev/zero of=test bs=1048576 count=4096 oflag=direct
3. ./aiotest test
  
Actual results:
Terminates with a message "corruption found" (after a nondeterministic amount of time, usually a second).

Expected results:
Having read the whole 4Gb it should fall into an infinite loop forking and waiting.

Additional info:
When the loop with fork()'s is replaced with an empty loop, everything works correctly.

Comment 1 Jeff Moyer 2009-07-16 13:43:08 UTC

This sounds like a duplicate of bug 471613.  Please try a kernel from here:
  http://people.redhat.com/dzickus/el5/158.el5/

Thanks,
Jeff

Comment 2 Marek Dopiera 2009-07-17 07:30:46 UTC

That's correct, the bug you mention is a duplicate, the kernel from the link above works correctly.

Thanks

Comment 3 Jeff Moyer 2009-07-17 13:36:06 UTC

OK, thanks for taking the time to confirm that, and thanks also for the reproducer.  While we are addressing this for RHEL 5, the upstream status of this patch is unclear.  Linus has rejected the fixes that we are incorporating, and I'm not sure at this point whether the bug will be fixed at all in upstream kernels.

If you're looking for ways to avoid this in the future, I would suggest not mixing fork(), pthreads and direct I/O to buffers that are smaller than the host system's page size.  In other words, if your reproducer program used getpagesize() sized pages instead of 512 byte pages, you wouldn't see this problem.

*** This bug has been marked as a duplicate of bug 471613 ***