There is a race induced by calling "echo scsi-remove-single-device a b c d" > /proc/scsi/scsi for a device while simultaneously doing an open() on the device node. Incidentally, this is what our OMSS product does... :-( proc_scsi_gen_write(): scsi-remove-single-device if (scd->access_count) goto out; SDTpnt = scsi_devicelist; while (SDTpnt != NULL) { if (SDTpnt->detach) (*SDTpnt->detach) (scd); SDTpnt = SDTpnt->next; sg_open(): if (sdp->detached) return -ENODEV; If the open() comes before sdp->detached is set, then the open will succeed and I/Os may be issued to a nonexistant device. While not observed on RHEL3 (yet), that the scsi_devicelist is still a global and not protected by any locks ---------- Action by: mdomsch Investigating preventing OMSS from doing the remove-single-device... Status set to: Waiting on Tech (Long Term) Severity set to: High ---------- Action by: wcheng I thought we just fixed this issue - checking the rhkernel-list now... wcheng assigned to issue for Support Engineering Group. ---------- Action by: wcheng ok, look like this is different from bugzilla 126158 but another hole in this add and remove device arena. Issue escalated to Sustaining Engineering by: wcheng. ---------- Action by: mdomsch Per concall discussion today, this needs to be pushed to Bugzilla and put on the Update 6 must-fix blocker list.
Doug is doing a substantial rework of the scsi midlayer surrounding device addition and removal. It will consist of adding smp locking. This is a substantial amount of work which may be somewhat high impact (ie large change). We are reluctant to make the corresponding change in the RHEL2.1 pool, as we are much more conservative there. As a result, its possible that this one does not get addressed in the RHEL2.1 U6 update. Its possible that through his rework in the RHEL3 pool that Doug may find a tactical minor change that addresses this one, but thats a stretch goal.
Per Matt on 10/29 a.m.: according to Dell's last evening testing of the respun .21-22 kernel (RHEL3), they are still seeing the system panic with a race between add device and remove device.
A fix for this problem has been committed to the RHEL2.1 U6 patch pool (in kernel version 2.4.9-e.52)
Making the bug public.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-505.html
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-504.html
Unfortunately, we are able to still induce a failure with this kernel, and scripts attached in IssueTracker 45654.
I believe the initial issue in this bug was fixed with Update 6, that being the race between open("/dev/sgX") and scsi-remove-single-device. Therefore I'm re-closing this.