Bug 131493 - [RHEL2.1] SCSI midlayer race on scsi_devicelist
[RHEL2.1] SCSI midlayer race on scsi_devicelist
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel (Show other bugs)
2.1
All Linux
medium Severity high
: ---
: ---
Assigned To: Doug Ledford
:
Depends On:
Blocks: 123573
  Show dependency treegraph
 
Reported: 2004-09-01 14:46 EDT by Frank Hirtz
Modified: 2007-11-30 17:06 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-12-14 13:11:10 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Frank Hirtz 2004-09-01 14:46:24 EDT
There is a race induced by calling "echo scsi-remove-single-device a b
c d" > /proc/scsi/scsi for a device while simultaneously doing an
open() on the device node.  Incidentally, this is what our OMSS
product does... :-(

proc_scsi_gen_write(): scsi-remove-single-device
                if (scd->access_count)
                        goto out;

                SDTpnt = scsi_devicelist;
                while (SDTpnt != NULL) {
                        if (SDTpnt->detach)
                                (*SDTpnt->detach) (scd);
                        SDTpnt = SDTpnt->next;

sg_open():
    if (sdp->detached)
        return -ENODEV;

If the open() comes before sdp->detached is set, then the open will
succeed and I/Os may be issued to a nonexistant device.



While not observed on RHEL3 (yet), that the scsi_devicelist is still a
global and not protected by any locks
----------
Action by: mdomsch
Investigating preventing OMSS from doing the remove-single-device...

Status set to: Waiting on Tech (Long Term)
Severity set to: High

----------
Action by: wcheng
I thought we just fixed this issue - checking the rhkernel-list now...

wcheng assigned to issue for Support Engineering Group.

----------
Action by: wcheng
ok, look like this is different from bugzilla 126158 but another hole
in this add and remove device arena. 


Issue escalated to Sustaining Engineering by: wcheng.

----------
Action by: mdomsch
Per concall discussion today, this needs to be pushed to Bugzilla and
put on the Update 6 must-fix blocker list.
Comment 1 Tim Burke 2004-09-13 20:33:01 EDT
Doug is doing a substantial rework of the scsi midlayer surrounding
device addition and removal. It will consist of adding smp locking. 
This is a substantial amount of work which may be somewhat high impact
(ie large change).  We are reluctant to make the corresponding change
in the RHEL2.1 pool, as we are much more conservative there.  As a
result, its possible that this one does not get addressed in the
RHEL2.1 U6 update.  Its possible that through his rework in the RHEL3
pool that Doug may find a tactical minor change that addresses this
one, but thats a stretch goal.
Comment 5 Frank Hirtz 2004-10-29 11:38:25 EDT
Per Matt on 10/29 a.m.:  according to Dell's last evening testing of
the respun .21-22 kernel (RHEL3), they are still seeing the system
panic with a race between add device and remove device.
Comment 6 Jim Paradis 2004-11-05 14:53:49 EST
A fix for this problem has been committed to the RHEL2.1 U6
patch pool (in kernel version 2.4.9-e.52)
Comment 7 Matt Domsch 2004-11-30 11:48:14 EST
Making the bug public.
Comment 8 John Flanagan 2004-12-13 15:06:29 EST
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-505.html
Comment 9 John Flanagan 2004-12-13 15:17:10 EST
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-504.html
Comment 10 Matt Domsch 2004-12-14 08:46:53 EST
Unfortunately, we are able to still induce a failure with this kernel,
and scripts attached in IssueTracker 45654.
Comment 12 Matt Domsch 2004-12-14 13:11:10 EST
I believe the initial issue in this bug was fixed with Update 6, that
being the race between open("/dev/sgX") and scsi-remove-single-device.
 Therefore I'm re-closing this.

Note You need to log in before you can comment on or make changes to this bug.