Bug 131518

Summary: [RHEL3 U4] SCSI midlayer race on scsi_devicelist
Product: Red Hat Enterprise Linux 3 Reporter: Frank Hirtz <fhirtz>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: low Docs Contact:
Priority: low    
Version: 3.0CC: dff, jbaron, mwesley, peterm, petrides, riel, tao, tburke, us_linux_engineering
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-11 18:52:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Kernel backtrace from Dell detailing the above reported panic
none
Cleanups needed to try and make device add/removal race free
none
More cleanups needed to make add/remove race free none

Description Frank Hirtz 2004-09-01 20:27:20 UTC
There is a race induced by calling "echo scsi-remove-single-device a b
c d" > /proc/scsi/scsi for a device while simultaneously doing an
open() on the device node.  Incidentally, this is what our OMSS
product does... :-(

proc_scsi_gen_write(): scsi-remove-single-device
                if (scd->access_count)
                        goto out;

                SDTpnt = scsi_devicelist;
                while (SDTpnt != NULL) {
                        if (SDTpnt->detach)
                                (*SDTpnt->detach) (scd);
                        SDTpnt = SDTpnt->next;

sg_open():
    if (sdp->detached)
        return -ENODEV;

If the open() comes before sdp->detached is set, then the open will
succeed and I/Os may be issued to a nonexistant device.



While not observed on RHEL3 (yet), that the scsi_devicelist is still a
global and not protected by any locks
----------
Action by: mdomsch
Investigating preventing OMSS from doing the remove-single-device...

Status set to: Waiting on Tech (Long Term)
Severity set to: High

----------
Action by: fhirtz


fhirtz assigned to issue for Dell-Engineering.

Category set to: Kernel
Status set to: Waiting on Tech

----------
Action by: fhirtz



Issue escalated to Support Engineering Group by: fhirtz.

----------
Action by: fhirtz


Summary edited.

----------
Action by: wcheng
I thought we just fixed this issue - checking the rhkernel-list now...

wcheng assigned to issue for Support Engineering Group.

----------
Action by: wcheng
ok, look like this is different from bugzilla 126158 but another hole
in this add and remove device arena. 


Issue escalated to Sustaining Engineering by: wcheng.

----------
Action by: sdenham
Escalated to Bugzilla
----------
Action by: mdomsch
Per concall discussion today, this needs to be pushed to Bugzilla and
put on the Update 6 must-fix blocker list.


----------
Action by: fhirtz
Escalated to Bugzilla
----------
Action by: fhirtz


Summary edited.

----------
Action by: fhirtz


Bugzilla id 131493 added to issue.

Comment 1 Frank Hirtz 2004-09-01 20:28:56 UTC
Note: This is the same issue as is logged in 131493, but against
RHEL3. Dell believes that this issue might impact us on RHEL3 though
it hasn't been explicitly seen on the platform as of yet.

Comment 4 Frank Hirtz 2004-09-29 16:39:09 UTC
<feedback from Dell>
This new timeline is problematic.  We didn't put any resources into
solving this ourselves because Tim Burke had said that he considered
this a showstopper for U4, and had assigned Doug to work on it.  This
problem lets root easily crash the kernel at runtime.  Updates
shouldn't be released with known kernel crash issues open.

Raising to Sev. 1 (Urgent).

Comment 6 Ernie Petrides 2004-10-19 03:02:21 UTC
A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-22.EL).


Comment 7 Frank Hirtz 2004-10-29 15:38:04 UTC
Per Matt on 10/29 a.m.:  according to Dell's last evening testing of
the respun .21-22 kernel, they are still seeing the system panic with
a race between add device and remove device.

Comment 8 Susan Denham 2004-10-29 15:39:59 UTC
Urgent:  Per Matt Domsch on 10/29 a.m.:  Dell tested the
RHEL 3 U4 respun kernel (.21-22) last evening and are STILL seeing the
system panic with a race between add device and remove device.

Comment 9 Don Howard 2004-11-09 17:39:09 UTC
Created attachment 106358 [details]
Kernel backtrace from Dell detailing the above reported panic

Comment 10 Ernie Petrides 2004-11-09 23:46:20 UTC
Frank/Sue/Don, can this problem be reproduced with a non-tainted kernel?

Doug, should they open a new bugzilla if so?  Or is the original problem
still unresolved?

(I'm leaving this bug in MODIFIED state under the assumption that Doug's
patch taken into 2.4.21-22.EL has actually fixed something.  Unless I'm
instructed otherwise, this bug will remain associated with the U4 erratum
and will automatically be closed when U4 is released.)


Comment 11 Matt Domsch 2004-11-30 16:46:33 UTC
Making the bug public.

Comment 12 Marty Wesley 2004-12-03 22:40:45 UTC
Matt, is there a good reason you made this bug public?

Comment 13 John Flanagan 2004-12-20 20:56:05 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html


Comment 15 Susan Denham 2005-01-11 17:17:49 UTC
Dell is reopening this issue, which is still seen on RHEL 3 U4 and
RHEL 2.1 U6.  While it's true that the U4 and U6 errata do address one
failing condition, there are other failures that are still not
resolved as identified in IT 45654 as follows:

"RHEL 2.1 U6 and RHEL3 U4 still can be made to fail using the scripts
noted above.  This needs to be a RHEL4 U5 blocker.

The access_count variables aren't always access protected by
q->queue_lock, and you use spin_lock_irq() in places where I suspect
_irqsave() might be more correct.  We're still hitting races with
sg_release() (aka sg_close()) and device removal using the ploink and
scanbus apps, and with our own OMSS utility. :-( "

Adding back to the U5 blocker list.

Comment 19 Doug Ledford 2005-03-01 08:25:11 UTC
Created attachment 111519 [details]
Cleanups needed to try and make device add/removal race free

Part 1 of the cleanups needed.

Comment 20 Doug Ledford 2005-03-01 08:26:23 UTC
Created attachment 111520 [details]
More cleanups needed to make add/remove race free

Part 2 of the needed cleanups.

Comment 21 Doug Ledford 2005-03-01 08:28:49 UTC
There still needs to be a Part 3 to complete this work.  As I
mentioned before, this is rather invasive, and there is a question as
to whether or not it would get past the internal patch review process
on the grounds that it's too risky of a change to core code without
first having upstream testing.

Comment 27 Ernie Petrides 2005-10-11 18:52:58 UTC
According to Doug Ledford, "fixing this bug entirely is very likely
beyond the scope of acceptable change in the rhel3 product."

Thus, I'm closing this as WONTFIX.