Bug 207114 - AIO layer returns OK even if it failed
Summary: AIO layer returns OK even if it failed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.4
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Jeff Moyer
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On: 245197
Blocks: 234251 236328 245198
TreeView+ depends on / blocked
 
Reported: 2006-09-19 14:38 UTC by Rafal Wijata
Modified: 2018-10-19 20:26 UTC (History)
4 users (show)

Fixed In Version: RHBA-2007-0791
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-15 16:15:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Fix race between I/O completion and aio exit path (5.44 KB, patch)
2007-01-12 17:59 UTC, Jeff Moyer
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0791 0 normal SHIPPED_LIVE Updated kernel packages available for Red Hat Enterprise Linux 4 Update 6 2007-11-14 18:25:55 UTC

Description Rafal Wijata 2006-09-19 14:38:41 UTC
Description of problem:
Two testing envinroments
- disk with badblocks
- failing disk/controller(occasionally is reset by kernel, no badblocks)

in both cases it's possible that issued aio_read will return success but the
read buffer is unmodified. I assume that the errorcode was lost during aio
processing.

Version-Release number of selected component (if applicable):
2.6.9-42.ELsmp

How reproducible:
try massive aio reads on failed device

Steps to Reproduce:
1.
2.
3.
  
Actual results:
no error and data not read

Expected results:
error

Additional info:
On same machines tests with
vanilla 2.6.18-rc6 + serie of 5 patches by Zach from
http://lkml.org/lkml/2006/9/5/267
caused the misbehaving to vanish.
No bad data was received during my tests.

Also it seems, that this patch fixed
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=198859 as well.

Comment 4 Jeff Moyer 2006-11-30 18:12:36 UTC
Just so you know that this issue isn't going unnoticed, I have looked at the
patches posted by Zach.  I reviewed them and tested them locally.  However, they
are quite intrusive, and as such, I would like to see a bit more testing
performed on them before they are merged into a RHEL update.  I am going to put
together some test kernels for this purpose and I'll let you know when they are
ready.  Any help you could provide in stress testing the test kernels would be
greatly appreciated.

Comment 5 Rafal Wijata 2006-12-01 08:46:06 UTC
Still have the disk which causes troubles on the shelf. Will try if new kernel
emerged.

Comment 6 Jeff Moyer 2006-12-01 19:51:17 UTC
OK, i686 kernels are available at the following location:

    http://people.redhat.com/jmoyer/dio/

The version of the kernel is 2.6.9-42.27.EL.dio.1.  Please let me know the
results of your testing (or if you need other architectures).

Comment 7 Rafal Wijata 2006-12-12 07:49:07 UTC
The situation seems little unclear. Running the testcase for few hours showed no
misbehaving. However after those few hours (like five hours) the server went
down. It was responding to pings, but nothing more.

From my point of view the issue was fixed. The patched kernel however may be not
so stable as we would like. On the bright side I remember that same thing was
happening on distribution kernel as well. Long stress test was bringing the
machine down as well.

Comment 8 Jeff Moyer 2006-12-12 14:05:59 UTC
Does the server respond to keyboard interrupts?  If so, try configuring either
netdump or diskdump, and pressing Alt-Sysrq-C to invoke a crashdump.  You may
also want to try to boot with "nmi_watchdog=1" on the kernel command line.

Just to be clear, you said that the kernel fixes the issue, and does not regress
from the stock RHEL kernel, correct?

Comment 9 Rafal Wijata 2006-12-15 11:14:02 UTC
Sorry - no access to keyboard.
> Just to be clear, you said that the kernel fixes the issue
Most propably, distro kernel never managed such long test.
> and does not regress from the stock RHEL kernel
Propably.

Unfortunatelly situation is difficult(like no access to KB) and can't test it at
my will. However such hangups were observed on distro kernel as well. It's just
this time it happened at first run. Maybe I was unlucky, maybe the beta kernel
hangs more often that distro one.

Comment 10 Jeff Moyer 2006-12-15 13:52:47 UTC
(In reply to comment #9)
> Sorry - no access to keyboard.
> > Just to be clear, you said that the kernel fixes the issue
> Most propably, distro kernel never managed such long test.
> > and does not regress from the stock RHEL kernel
> Propably.
> 
> Unfortunatelly situation is difficult(like no access to KB) and can't test it at
> my will. However such hangups were observed on distro kernel as well. It's just
> this time it happened at first run. Maybe I was unlucky, maybe the beta kernel
> hangs more often that distro one.

OK.  I'd still recommend adding nmi_watchdog=1 to the kernel command line.  It
may not trigger a panic, though, since it sounds like the system is still
processing interrupts.  If we wanted to get really tricky, we could hack up the
network stack so that a specially crafted ping packet could invoke a panic; 
obviously I don't recommend this for a production server.

Comment 11 Rafal Wijata 2006-12-19 13:39:15 UTC
After longer testing with more complicated tests, seems like hitting bug210281
is like piece of cake. Therefore stability is lower.

Comment 12 Jeff Moyer 2006-12-19 14:44:59 UTC
Are you able to generate a crash dump?

Comment 13 Jeff Moyer 2006-12-19 14:52:26 UTC
Also, can you elaborate on "more complicated tests"?  I'd like to reproduce this
in-house, if at all possible.

Comment 14 Jeff Moyer 2007-01-12 17:59:02 UTC
Created attachment 145467 [details]
Fix race between I/O completion and aio exit path

wijita, can you please test with this patch (in addition to the patches already
applied to the test kernel)?  It moves the testing of ctx->reqs_active inside
the ctx_lock, which I think could prevent the race which gets you stuck in
wait_for_all_aios.

Let me know if you need me to spin another kernel.

thanks!

Comment 15 Rafal Wijata 2007-01-15 10:28:41 UTC
> Let me know if you need me to spin another kernel.
That would be nice. Then I would be certain that I haven't changed anything.
I understand, that it should fix the bug210281 ?

Comment 16 Jeff Moyer 2007-01-15 22:55:55 UTC
(In reply to comment #15)
> > Let me know if you need me to spin another kernel.
> That would be nice. Then I would be certain that I haven't changed anything.
> I understand, that it should fix the bug210281 ?

Yes, that is what I would like to verify.  New kernels can be found here:

  http://people.redhat.com/jmoyer/dio/

They are the dio.2 variants.

Thanks!

Comment 17 Jeff Moyer 2007-01-16 18:03:17 UTC
Note that the UP kernels will not boot.  The reason is that the RHEL 4 kernels
do not have "assert_spin_locked," and so I substituted
BUG_ON(!spin_is_locked()).  This is not a good substitution, as spin_is_locked
always returns false in UP.  As long as you stick with the SMP variants, you
should be fine.

Comment 21 RHEL Program Management 2007-04-27 19:22:48 UTC
This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 23 Jason Baron 2007-06-20 19:44:50 UTC
committed in stream U6 build 55.10. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 26 John Poelstra 2007-08-29 16:37:38 UTC
A fix for this issue should have been included in the packages contained in the
RHEL4.6 Beta released on RHN (also available at partners.redhat.com).  

Requested action: Please verify that your issue is fixed to ensure that it is
included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.


Comment 27 John Poelstra 2007-09-05 22:27:14 UTC
A fix for this issue should have been included in the packages contained in 
the RHEL4.6-Snapshot1 on partners.redhat.com.  

Requested action: Please verify that your issue is fixed to ensure that it is 
included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed, 
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent 
symptoms of the problem you are having and change the status of the bug to 
FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test 
results to Issue Tracker.  If you need assistance accessing 
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 28 John Poelstra 2007-09-12 00:43:00 UTC
A fix for this issue should be included in RHEL4.6-Snapshot2--available soon on
partners.redhat.com.  

Please verify that your issue is fixed to ensure that it is included in this
update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test
results to Issue Tracker.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 29 John Poelstra 2007-09-20 04:31:18 UTC
A fix for this issue should have been included in the packages contained in the
RHEL4.6-Snapshot3 on partners.redhat.com.  

Please verify that your issue is fixed to ensure that it is included in this
update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test
results to Issue Tracker.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.


Comment 30 John Poelstra 2007-09-26 23:36:31 UTC
A fix for this issue should be included in the packages contained in
RHEL4.6-Snapshot4--available now on partners.redhat.com.  

Please verify that your issue is fixed ASAP to ensure that it is included in
this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test
results to Issue Tracker.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 31 John Poelstra 2007-10-05 02:58:32 UTC
A fix for this issue should be included in the packages contained in
RHEL4.6-Snapshot5--available now on partners.redhat.com.  

Please verify that your issue is fixed ASAP to ensure that it is included in
this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test
results to Issue Tracker.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 32 John Poelstra 2007-10-11 03:10:12 UTC
A fix for this issue should be included in the packages contained in
RHEL4.6-Snapshot6--available now on partners.redhat.com.  

Please verify that your issue is fixed ASAP to ensure that it is included in
this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test
results to Issue Tracker.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.


Comment 33 John Poelstra 2007-10-18 18:54:13 UTC
A fix for this issue should be included in the packages contained in 
RHEL4.6-Snapshot7--available now on partners.redhat.com.  

IMPORTANT: This is the last opportunity to confirm that your issue is fixed in 
the RHEL4.6 update release.

After you (Red Hat Partner) have verified that this issue has been addressed, 
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent 
symptoms of the problem you are having and change the status of the bug to 
FAILS_QA.

If you cannot access bugzilla, please reply with a message about your test 
results to Issue Tracker.  If you need assistance accessing 
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 36 errata-xmlrpc 2007-11-15 16:15:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html



Note You need to log in before you can comment on or make changes to this bug.