Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 242116

Summary:

race in aio io_submit write/read

Product:

Red Hat Enterprise Linux 5

Reporter:

Jeff Moyer <jmoyer>

Component:

kernel

Assignee:

Jeff Moyer <jmoyer>

Status:

CLOSED ERRATA

QA Contact:

Martin Jenner <mjenner>

Severity:

high

Docs Contact:

Priority:

medium

Version:

5.0

CC:

dzickus, i-kitayama

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2007-0959

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-11-07 19:50:50 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
[PATCH] dio: centralize completion in dio_complete()	none
[PATCH] dio: call blk_run_address_space() once per op	none
[PATCH] dio: formalize bio counters as a dio reference count	none
[PATCH] dio: remove duplicate bio wait code	none
[PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED	none

Description Jeff Moyer 2007-06-01 19:25:35 UTC

+++ This bug was initially created as a clone of Bug #198859 +++

Description of problem:
See http://bugzilla.kernel.org/show_bug.cgi?id=6831

Version-Release number of selected component (if applicable):
See http://bugzilla.kernel.org/show_bug.cgi?id=6831

How reproducible:
See http://bugzilla.kernel.org/show_bug.cgi?id=6831

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

-- Additional comment from jmoyer on 2006-08-09 13:52 EST --
I'll work with Zach on this.

-- Additional comment from jmoyer on 2007-01-12 18:04 EST --
I put together a kernel that includes Zach's patch set.  Could you please test
with this kernel to ensure that the bug is fixed for you?  I will test this on
Monday, but for now my RHEL 4 test box is toast.

-- Additional comment from wijata on 2007-01-15 05:27 EST --
Since there is a testcase, I'll pass over. Sorry have other stuff todo.

-- Additional comment from jmoyer on 2007-01-15 10:14 EST --
No problem, I'll take care of it.

Thanks!

-- Additional comment from jmoyer on 2007-01-16 13:01 EST --
The kernel which includes Zach's patch set (found at
http://people.redhat.com/jmoyer/dio/) fixes this problem in my test environment.

-- Additional comment from wijata on 2007-02-26 08:24 EST --
I have some bad news. It looks like the race was fixed, but another bug was
introduced.
First i encountered following scenario:
aio read issued to the file - OK
exactly same aio read issued to same file - returned less data than the first.
of course between those two reads there were some appends to the file.

Decided to create testcase, success. Moreover, upon the testcase failure
bug210281 was hit as well.

The test is ugly, and need iterating, but after 5 hours it fired at mine
machine. I hope the test is valid(need verification), but hitting bug210281
suggests it's valid.

compile(without DO_SYNCH), run as: while ./aio_backread; do true; done

-- Additional comment from wijata on 2007-02-26 08:28 EST --
Created an attachment (id=148791)
testcase for shortread

Tested on Tyan(something) 2*opteron275 with sata disk WD3200JS-00P attached to
sata_sil(onboard)

-- Additional comment from wijata on 2007-02-26 08:42 EST --
The kernel version: Linux server 2.6.9-42.27.EL.dio.2smp #1 SMP Mon Jan 15
15:05:46 EST 2007 i686 athlon i386 GNU/Linux

-- Additional comment from wijata on 2007-02-26 08:56 EST --
> compile(without DO_SYNCH)
in fact compiling with the DO_SYNCH (gcc aio_backread.c -DDO_SYNCH -lpthread
-laio) made the bug rendering faster(but no bug210281). No idea why...
Still trying on official RH kernel...

-- Additional comment from jmoyer on 2007-02-26 14:33 EST --
Thanks for your continued help on this problem;  it is greatly appreciated.  I
have some comments on the code for you, and some test results.

I'm not sure you should use the same buffer for concurrent reads and writes.  It
looks like you meant to initialize the write iocbs with buf.  Is that right?

The DO_SYNCH code looks incorrect to me.  You should guard access to
submittedSize with a mutex/conditional pair.

I was able to reproduce the short read twice so far.  In both cases, I got the
following output:
  short read: expected at least 1024 bytes, receiced 0

I think the expected size of 1024 can only happen after the very first set of
writes completes, right?  Is this consistent with the failures you've witnessed?

-- Additional comment from jmoyer on 2007-02-26 14:58 EST --
Sure enough, I modified the source code to only issue a dozen reads and then to
cancel the write thread and exit the program.  I can now reproduce the short
read in a matter of seconds.

-- Additional comment from wijata on 2007-02-27 03:09 EST --
> looks like you meant to initialize the write iocbs with buf.  Is that right?
Right, my mistake.

> submittedSize with a mutex/conditional pair
possibly(or atomic asm op), but: int writes are atomic anyway on x86, and the
bug appears without synch as well.

> I can now reproduce the short read in a matter of seconds.
glad You could improve the testcase.

BTW: official RH kernel 2.6.9-42.0.3.ELsmp with the synch was working well for
20hours, so I guess the patch introduced something malicious.

-- Additional comment from pm-rhel on 2007-04-27 15:24 EST --
This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 1 Jeff Moyer 2007-06-01 19:25:35 UTC

Created attachment 155921 [details]
[PATCH] dio: centralize completion in dio_complete()

Comment 2 Jeff Moyer 2007-06-01 19:26:07 UTC

Created attachment 155922 [details]
[PATCH] dio: call blk_run_address_space() once per op

Comment 3 Jeff Moyer 2007-06-01 19:26:33 UTC

Created attachment 155923 [details]
[PATCH] dio: formalize bio counters as a dio reference count

Comment 4 Jeff Moyer 2007-06-01 19:26:52 UTC

Created attachment 155924 [details]
[PATCH] dio: remove duplicate bio wait code

Comment 5 Jeff Moyer 2007-06-01 19:27:14 UTC

Created attachment 155925 [details]
[PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED

Comment 6 Jeff Moyer 2007-06-01 19:28:57 UTC

This bug needs to get fixed as it is a data corrupter.  There is a reproducer
for this bug:
  https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=148791
so it should be easy to verify that it is fixed.

Comment 7 RHEL Program Management 2007-06-01 19:44:01 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 RHEL Program Management 2007-06-01 20:01:58 UTC

This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 9 Don Zickus 2007-06-16 00:32:09 UTC

in 2.6.18-27.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 errata-xmlrpc 2007-11-07 19:50:50 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html