There is a race induced by calling "echo scsi-remove-single-device a b c d" > /proc/scsi/scsi for a device while simultaneously doing an open() on the device node. Incidentally, this is what our OMSS product does... :-( proc_scsi_gen_write(): scsi-remove-single-device if (scd->access_count) goto out; SDTpnt = scsi_devicelist; while (SDTpnt != NULL) { if (SDTpnt->detach) (*SDTpnt->detach) (scd); SDTpnt = SDTpnt->next; sg_open(): if (sdp->detached) return -ENODEV; If the open() comes before sdp->detached is set, then the open will succeed and I/Os may be issued to a nonexistant device. While not observed on RHEL3 (yet), that the scsi_devicelist is still a global and not protected by any locks ---------- Action by: mdomsch Investigating preventing OMSS from doing the remove-single-device... Status set to: Waiting on Tech (Long Term) Severity set to: High ---------- Action by: fhirtz fhirtz assigned to issue for Dell-Engineering. Category set to: Kernel Status set to: Waiting on Tech ---------- Action by: fhirtz Issue escalated to Support Engineering Group by: fhirtz. ---------- Action by: fhirtz Summary edited. ---------- Action by: wcheng I thought we just fixed this issue - checking the rhkernel-list now... wcheng assigned to issue for Support Engineering Group. ---------- Action by: wcheng ok, look like this is different from bugzilla 126158 but another hole in this add and remove device arena. Issue escalated to Sustaining Engineering by: wcheng. ---------- Action by: sdenham Escalated to Bugzilla ---------- Action by: mdomsch Per concall discussion today, this needs to be pushed to Bugzilla and put on the Update 6 must-fix blocker list. ---------- Action by: fhirtz Escalated to Bugzilla ---------- Action by: fhirtz Summary edited. ---------- Action by: fhirtz Bugzilla id 131493 added to issue.
Note: This is the same issue as is logged in 131493, but against RHEL3. Dell believes that this issue might impact us on RHEL3 though it hasn't been explicitly seen on the platform as of yet.
<feedback from Dell> This new timeline is problematic. We didn't put any resources into solving this ourselves because Tim Burke had said that he considered this a showstopper for U4, and had assigned Doug to work on it. This problem lets root easily crash the kernel at runtime. Updates shouldn't be released with known kernel crash issues open. Raising to Sev. 1 (Urgent).
A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-22.EL).
Per Matt on 10/29 a.m.: according to Dell's last evening testing of the respun .21-22 kernel, they are still seeing the system panic with a race between add device and remove device.
Urgent: Per Matt Domsch on 10/29 a.m.: Dell tested the RHEL 3 U4 respun kernel (.21-22) last evening and are STILL seeing the system panic with a race between add device and remove device.
Created attachment 106358 [details] Kernel backtrace from Dell detailing the above reported panic
Frank/Sue/Don, can this problem be reproduced with a non-tainted kernel? Doug, should they open a new bugzilla if so? Or is the original problem still unresolved? (I'm leaving this bug in MODIFIED state under the assumption that Doug's patch taken into 2.4.21-22.EL has actually fixed something. Unless I'm instructed otherwise, this bug will remain associated with the U4 erratum and will automatically be closed when U4 is released.)
Making the bug public.
Matt, is there a good reason you made this bug public?
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html
Dell is reopening this issue, which is still seen on RHEL 3 U4 and RHEL 2.1 U6. While it's true that the U4 and U6 errata do address one failing condition, there are other failures that are still not resolved as identified in IT 45654 as follows: "RHEL 2.1 U6 and RHEL3 U4 still can be made to fail using the scripts noted above. This needs to be a RHEL4 U5 blocker. The access_count variables aren't always access protected by q->queue_lock, and you use spin_lock_irq() in places where I suspect _irqsave() might be more correct. We're still hitting races with sg_release() (aka sg_close()) and device removal using the ploink and scanbus apps, and with our own OMSS utility. :-( " Adding back to the U5 blocker list.
Created attachment 111519 [details] Cleanups needed to try and make device add/removal race free Part 1 of the cleanups needed.
Created attachment 111520 [details] More cleanups needed to make add/remove race free Part 2 of the needed cleanups.
There still needs to be a Part 3 to complete this work. As I mentioned before, this is rather invasive, and there is a question as to whether or not it would get past the internal patch review process on the grounds that it's too risky of a change to core code without first having upstream testing.
According to Doug Ledford, "fixing this bug entirely is very likely beyond the scope of acceptable change in the rhel3 product." Thus, I'm closing this as WONTFIX.