Bug 131493 - [RHEL2.1] SCSI midlayer race on scsi_devicelist
Summary: [RHEL2.1] SCSI midlayer race on scsi_devicelist
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel
Version: 2.1
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Doug Ledford
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 123573
TreeView+ depends on / blocked
 
Reported: 2004-09-01 18:46 UTC by Frank Hirtz
Modified: 2007-11-30 22:06 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-12-14 18:11:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2004:504 0 normal SHIPPED_LIVE Important: Updated Itanium kernel packages resolve security issues 2004-12-13 05:00:00 UTC
Red Hat Product Errata RHSA-2004:505 0 normal SHIPPED_LIVE Important: Updated kernel packages fix security vulnerability 2004-12-13 05:00:00 UTC

Description Frank Hirtz 2004-09-01 18:46:24 UTC
There is a race induced by calling "echo scsi-remove-single-device a b
c d" > /proc/scsi/scsi for a device while simultaneously doing an
open() on the device node.  Incidentally, this is what our OMSS
product does... :-(

proc_scsi_gen_write(): scsi-remove-single-device
                if (scd->access_count)
                        goto out;

                SDTpnt = scsi_devicelist;
                while (SDTpnt != NULL) {
                        if (SDTpnt->detach)
                                (*SDTpnt->detach) (scd);
                        SDTpnt = SDTpnt->next;

sg_open():
    if (sdp->detached)
        return -ENODEV;

If the open() comes before sdp->detached is set, then the open will
succeed and I/Os may be issued to a nonexistant device.



While not observed on RHEL3 (yet), that the scsi_devicelist is still a
global and not protected by any locks
----------
Action by: mdomsch
Investigating preventing OMSS from doing the remove-single-device...

Status set to: Waiting on Tech (Long Term)
Severity set to: High

----------
Action by: wcheng
I thought we just fixed this issue - checking the rhkernel-list now...

wcheng assigned to issue for Support Engineering Group.

----------
Action by: wcheng
ok, look like this is different from bugzilla 126158 but another hole
in this add and remove device arena. 


Issue escalated to Sustaining Engineering by: wcheng.

----------
Action by: mdomsch
Per concall discussion today, this needs to be pushed to Bugzilla and
put on the Update 6 must-fix blocker list.

Comment 1 Tim Burke 2004-09-14 00:33:01 UTC
Doug is doing a substantial rework of the scsi midlayer surrounding
device addition and removal. It will consist of adding smp locking. 
This is a substantial amount of work which may be somewhat high impact
(ie large change).  We are reluctant to make the corresponding change
in the RHEL2.1 pool, as we are much more conservative there.  As a
result, its possible that this one does not get addressed in the
RHEL2.1 U6 update.  Its possible that through his rework in the RHEL3
pool that Doug may find a tactical minor change that addresses this
one, but thats a stretch goal.


Comment 5 Frank Hirtz 2004-10-29 15:38:25 UTC
Per Matt on 10/29 a.m.:  according to Dell's last evening testing of
the respun .21-22 kernel (RHEL3), they are still seeing the system
panic with a race between add device and remove device.

Comment 6 Jim Paradis 2004-11-05 19:53:49 UTC
A fix for this problem has been committed to the RHEL2.1 U6
patch pool (in kernel version 2.4.9-e.52)


Comment 7 Matt Domsch 2004-11-30 16:48:14 UTC
Making the bug public.

Comment 8 John Flanagan 2004-12-13 20:06:29 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-505.html


Comment 9 John Flanagan 2004-12-13 20:17:10 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-504.html


Comment 10 Matt Domsch 2004-12-14 13:46:53 UTC
Unfortunately, we are able to still induce a failure with this kernel,
and scripts attached in IssueTracker 45654.

Comment 12 Matt Domsch 2004-12-14 18:11:10 UTC
I believe the initial issue in this bug was fixed with Update 6, that
being the race between open("/dev/sgX") and scsi-remove-single-device.
 Therefore I'm re-closing this.


Note You need to log in before you can comment on or make changes to this bug.