Description of problem: Two testing envinroments - disk with badblocks - failing disk/controller(occasionally is reset by kernel, no badblocks) in both cases it's possible that issued aio_read will return success but the read buffer is unmodified. I assume that the errorcode was lost during aio processing. Version-Release number of selected component (if applicable): 2.6.9-42.ELsmp How reproducible: try massive aio reads on failed device Steps to Reproduce: 1. 2. 3. Actual results: no error and data not read Expected results: error Additional info: On same machines tests with vanilla 2.6.18-rc6 + serie of 5 patches by Zach from http://lkml.org/lkml/2006/9/5/267 caused the misbehaving to vanish. No bad data was received during my tests. Also it seems, that this patch fixed https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=198859 as well.
Just so you know that this issue isn't going unnoticed, I have looked at the patches posted by Zach. I reviewed them and tested them locally. However, they are quite intrusive, and as such, I would like to see a bit more testing performed on them before they are merged into a RHEL update. I am going to put together some test kernels for this purpose and I'll let you know when they are ready. Any help you could provide in stress testing the test kernels would be greatly appreciated.
Still have the disk which causes troubles on the shelf. Will try if new kernel emerged.
OK, i686 kernels are available at the following location: http://people.redhat.com/jmoyer/dio/ The version of the kernel is 2.6.9-42.27.EL.dio.1. Please let me know the results of your testing (or if you need other architectures).
The situation seems little unclear. Running the testcase for few hours showed no misbehaving. However after those few hours (like five hours) the server went down. It was responding to pings, but nothing more. From my point of view the issue was fixed. The patched kernel however may be not so stable as we would like. On the bright side I remember that same thing was happening on distribution kernel as well. Long stress test was bringing the machine down as well.
Does the server respond to keyboard interrupts? If so, try configuring either netdump or diskdump, and pressing Alt-Sysrq-C to invoke a crashdump. You may also want to try to boot with "nmi_watchdog=1" on the kernel command line. Just to be clear, you said that the kernel fixes the issue, and does not regress from the stock RHEL kernel, correct?
Sorry - no access to keyboard. > Just to be clear, you said that the kernel fixes the issue Most propably, distro kernel never managed such long test. > and does not regress from the stock RHEL kernel Propably. Unfortunatelly situation is difficult(like no access to KB) and can't test it at my will. However such hangups were observed on distro kernel as well. It's just this time it happened at first run. Maybe I was unlucky, maybe the beta kernel hangs more often that distro one.
(In reply to comment #9) > Sorry - no access to keyboard. > > Just to be clear, you said that the kernel fixes the issue > Most propably, distro kernel never managed such long test. > > and does not regress from the stock RHEL kernel > Propably. > > Unfortunatelly situation is difficult(like no access to KB) and can't test it at > my will. However such hangups were observed on distro kernel as well. It's just > this time it happened at first run. Maybe I was unlucky, maybe the beta kernel > hangs more often that distro one. OK. I'd still recommend adding nmi_watchdog=1 to the kernel command line. It may not trigger a panic, though, since it sounds like the system is still processing interrupts. If we wanted to get really tricky, we could hack up the network stack so that a specially crafted ping packet could invoke a panic; obviously I don't recommend this for a production server.
After longer testing with more complicated tests, seems like hitting bug210281 is like piece of cake. Therefore stability is lower.
Are you able to generate a crash dump?
Also, can you elaborate on "more complicated tests"? I'd like to reproduce this in-house, if at all possible.
Created attachment 145467 [details] Fix race between I/O completion and aio exit path wijita, can you please test with this patch (in addition to the patches already applied to the test kernel)? It moves the testing of ctx->reqs_active inside the ctx_lock, which I think could prevent the race which gets you stuck in wait_for_all_aios. Let me know if you need me to spin another kernel. thanks!
> Let me know if you need me to spin another kernel. That would be nice. Then I would be certain that I haven't changed anything. I understand, that it should fix the bug210281 ?
(In reply to comment #15) > > Let me know if you need me to spin another kernel. > That would be nice. Then I would be certain that I haven't changed anything. > I understand, that it should fix the bug210281 ? Yes, that is what I would like to verify. New kernels can be found here: http://people.redhat.com/jmoyer/dio/ They are the dio.2 variants. Thanks!
Note that the UP kernels will not boot. The reason is that the RHEL 4 kernels do not have "assert_spin_locked," and so I substituted BUG_ON(!spin_is_locked()). This is not a good substitution, as spin_is_locked always returns false in UP. As long as you stick with the SMP variants, you should be fine.
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
committed in stream U6 build 55.10. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
A fix for this issue should have been included in the packages contained in the RHEL4.6 Beta released on RHN (also available at partners.redhat.com). Requested action: Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should have been included in the packages contained in the RHEL4.6-Snapshot1 on partners.redhat.com. Requested action: Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should be included in RHEL4.6-Snapshot2--available soon on partners.redhat.com. Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should have been included in the packages contained in the RHEL4.6-Snapshot3 on partners.redhat.com. Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should be included in the packages contained in RHEL4.6-Snapshot4--available now on partners.redhat.com. Please verify that your issue is fixed ASAP to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should be included in the packages contained in RHEL4.6-Snapshot5--available now on partners.redhat.com. Please verify that your issue is fixed ASAP to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should be included in the packages contained in RHEL4.6-Snapshot6--available now on partners.redhat.com. Please verify that your issue is fixed ASAP to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should be included in the packages contained in RHEL4.6-Snapshot7--available now on partners.redhat.com. IMPORTANT: This is the last opportunity to confirm that your issue is fixed in the RHEL4.6 update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html