Bug 131493

Summary: [RHEL2.1] SCSI midlayer race on scsi_devicelist
Product: Red Hat Enterprise Linux 2.1 Reporter: Frank Hirtz <fhirtz>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: dhoward, jbaron, jparadis, riel, tao, us_linux_engineering
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-12-14 18:11:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 123573    

Description Frank Hirtz 2004-09-01 18:46:24 UTC
There is a race induced by calling "echo scsi-remove-single-device a b
c d" > /proc/scsi/scsi for a device while simultaneously doing an
open() on the device node.  Incidentally, this is what our OMSS
product does... :-(

proc_scsi_gen_write(): scsi-remove-single-device
                if (scd->access_count)
                        goto out;

                SDTpnt = scsi_devicelist;
                while (SDTpnt != NULL) {
                        if (SDTpnt->detach)
                                (*SDTpnt->detach) (scd);
                        SDTpnt = SDTpnt->next;

sg_open():
    if (sdp->detached)
        return -ENODEV;

If the open() comes before sdp->detached is set, then the open will
succeed and I/Os may be issued to a nonexistant device.



While not observed on RHEL3 (yet), that the scsi_devicelist is still a
global and not protected by any locks
----------
Action by: mdomsch
Investigating preventing OMSS from doing the remove-single-device...

Status set to: Waiting on Tech (Long Term)
Severity set to: High

----------
Action by: wcheng
I thought we just fixed this issue - checking the rhkernel-list now...

wcheng assigned to issue for Support Engineering Group.

----------
Action by: wcheng
ok, look like this is different from bugzilla 126158 but another hole
in this add and remove device arena. 


Issue escalated to Sustaining Engineering by: wcheng.

----------
Action by: mdomsch
Per concall discussion today, this needs to be pushed to Bugzilla and
put on the Update 6 must-fix blocker list.

Comment 1 Tim Burke 2004-09-14 00:33:01 UTC
Doug is doing a substantial rework of the scsi midlayer surrounding
device addition and removal. It will consist of adding smp locking. 
This is a substantial amount of work which may be somewhat high impact
(ie large change).  We are reluctant to make the corresponding change
in the RHEL2.1 pool, as we are much more conservative there.  As a
result, its possible that this one does not get addressed in the
RHEL2.1 U6 update.  Its possible that through his rework in the RHEL3
pool that Doug may find a tactical minor change that addresses this
one, but thats a stretch goal.


Comment 5 Frank Hirtz 2004-10-29 15:38:25 UTC
Per Matt on 10/29 a.m.:  according to Dell's last evening testing of
the respun .21-22 kernel (RHEL3), they are still seeing the system
panic with a race between add device and remove device.

Comment 6 Jim Paradis 2004-11-05 19:53:49 UTC
A fix for this problem has been committed to the RHEL2.1 U6
patch pool (in kernel version 2.4.9-e.52)


Comment 7 Matt Domsch 2004-11-30 16:48:14 UTC
Making the bug public.

Comment 8 John Flanagan 2004-12-13 20:06:29 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-505.html


Comment 9 John Flanagan 2004-12-13 20:17:10 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-504.html


Comment 10 Matt Domsch 2004-12-14 13:46:53 UTC
Unfortunately, we are able to still induce a failure with this kernel,
and scripts attached in IssueTracker 45654.

Comment 12 Matt Domsch 2004-12-14 18:11:10 UTC
I believe the initial issue in this bug was fixed with Update 6, that
being the race between open("/dev/sgX") and scsi-remove-single-device.
 Therefore I'm re-closing this.