Bug 131518
Summary: | [RHEL3 U4] SCSI midlayer race on scsi_devicelist | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Frank Hirtz <fhirtz> |
Component: | kernel | Assignee: | Doug Ledford <dledford> |
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 3.0 | CC: | dff, jbaron, mwesley, peterm, petrides, riel, tao, tburke, us_linux_engineering |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-10-11 18:52:58 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Frank Hirtz
2004-09-01 20:27:20 UTC
Note: This is the same issue as is logged in 131493, but against RHEL3. Dell believes that this issue might impact us on RHEL3 though it hasn't been explicitly seen on the platform as of yet. <feedback from Dell> This new timeline is problematic. We didn't put any resources into solving this ourselves because Tim Burke had said that he considered this a showstopper for U4, and had assigned Doug to work on it. This problem lets root easily crash the kernel at runtime. Updates shouldn't be released with known kernel crash issues open. Raising to Sev. 1 (Urgent). A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-22.EL). Per Matt on 10/29 a.m.: according to Dell's last evening testing of the respun .21-22 kernel, they are still seeing the system panic with a race between add device and remove device. Urgent: Per Matt Domsch on 10/29 a.m.: Dell tested the RHEL 3 U4 respun kernel (.21-22) last evening and are STILL seeing the system panic with a race between add device and remove device. Created attachment 106358 [details]
Kernel backtrace from Dell detailing the above reported panic
Frank/Sue/Don, can this problem be reproduced with a non-tainted kernel? Doug, should they open a new bugzilla if so? Or is the original problem still unresolved? (I'm leaving this bug in MODIFIED state under the assumption that Doug's patch taken into 2.4.21-22.EL has actually fixed something. Unless I'm instructed otherwise, this bug will remain associated with the U4 erratum and will automatically be closed when U4 is released.) Making the bug public. Matt, is there a good reason you made this bug public? An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html Dell is reopening this issue, which is still seen on RHEL 3 U4 and RHEL 2.1 U6. While it's true that the U4 and U6 errata do address one failing condition, there are other failures that are still not resolved as identified in IT 45654 as follows: "RHEL 2.1 U6 and RHEL3 U4 still can be made to fail using the scripts noted above. This needs to be a RHEL4 U5 blocker. The access_count variables aren't always access protected by q->queue_lock, and you use spin_lock_irq() in places where I suspect _irqsave() might be more correct. We're still hitting races with sg_release() (aka sg_close()) and device removal using the ploink and scanbus apps, and with our own OMSS utility. :-( " Adding back to the U5 blocker list. Created attachment 111519 [details]
Cleanups needed to try and make device add/removal race free
Part 1 of the cleanups needed.
Created attachment 111520 [details]
More cleanups needed to make add/remove race free
Part 2 of the needed cleanups.
There still needs to be a Part 3 to complete this work. As I mentioned before, this is rather invasive, and there is a question as to whether or not it would get past the internal patch review process on the grounds that it's too risky of a change to core code without first having upstream testing. According to Doug Ledford, "fixing this bug entirely is very likely beyond the scope of acceptable change in the rhel3 product." Thus, I'm closing this as WONTFIX. |