Bug 142858 - RHEL2.1 U6: race between scsi-remove-single-device and scsi-add-single-device
Summary: RHEL2.1 U6: race between scsi-remove-single-device and scsi-add-single-device
Alias: None
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel   
(Show other bugs)
Version: 2.1
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: Jim Paradis
QA Contact:
Depends On:
Blocks: 132992
TreeView+ depends on / blocked
Reported: 2004-12-14 18:19 UTC by Matt Domsch
Modified: 2013-08-06 01:10 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2005-02-18 21:07:31 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

Description Matt Domsch 2004-12-14 18:19:57 UTC
Description of problem:
Outlined in IssueTracker 45654.  There were two races described there,
against two OS versions.  This bugzilla is to address only one of
those scenarios, RHEL 2.1 <= Update 6 (kernel 2.4.9-e.57) and the race
between scsi-remove-single-device and scsi-add-single-device going to
the same device.

This is a snipped of disassembly/source from scsi_scan.c in the
function scan_scsis().  We observe a NULL pointer deference in the
code below when multiple instances of a script which adds and removes
SCSI devices is run.

                for (dqptr = shpnt->host_queue; dqptr != SDpnt; dqptr
= dqptr->next)
    b6fd:       8b 8c 24 28 01 00 00    mov    0x128(%esp,1),%ecx
    b704:       8b 51 04                mov    0x4(%ecx),%edx
    b707:       39 c2                   cmp    %eax,%edx
    b709:       74 0b                   je     b716 <scan_scsis+0x3c6>
    b70b:       89 c1                   mov    %eax,%ecx
    b70d:       8d 76 00                lea    0x0(%esi),%esi
    b710:       8b 12                   mov    (%edx),%edx
          (the above line is the NULL ptr dereference, %edx=0)
    b712:       39 ca                   cmp    %ecx,%edx
    b714:       75 fa                   jne    b710 <scan_scsis+0x3c0>

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. configure system with lots of disks, no file systems on them
2. run multiple instances of scanbus script
3. crash in <2 minutes
Actual results:
Null ptr dereference and kernel crash

Expected results:
no crash

Comment 1 Matt Domsch 2004-12-14 19:09:50 UTC
This is the result of following the linked list of Scsi_Devices on the
host, and finding that the ->next pointer is NULL but never testing
for that.  First question is, why did the remove succeed while we were
using it, and the answer is that SDpnt->access_count was never
incremented at SDpnt allocation, though we are using it.  So, fix
that, and decrement it later.  Throw in the NULL case check when
walking the ->next ptr, and we should be better.

Doug, how's this look?

I'll have our test team try this patch.

--- scsi_scan.c.~1~     Thu Dec  2 19:25:29 2004
+++ scsi_scan.c Tue Dec 14 08:00:18 2004
@@ -401,6 +401,7 @@ void scan_scsis(struct Scsi_Host *shpnt,
        SDpnt->queue_depth = 1;
        SDpnt->host = shpnt;
        SDpnt->online = TRUE;
+       SDpnt->access_count = 1;


@@ -521,8 +522,10 @@ void scan_scsis(struct Scsi_Host *shpnt,
        {                       /* Unchain SRpnt from host_queue */
                Scsi_Device *prev, *next;
                Scsi_Device *dqptr;
+               request_queue_t *q;
+               unsigned long flags;

-               for (dqptr = shpnt->host_queue; dqptr != SDpnt; dqptr
= dqptr->next)
+               for (dqptr = shpnt->host_queue; dqptr && dqptr !=
SDpnt; dqptr = dqptr->next)
                if (dqptr) {
                        prev = dqptr->prev;
@@ -533,6 +536,11 @@ void scan_scsis(struct Scsi_Host *shpnt,
                                shpnt->host_queue = next;
                        if (next)
                                next->prev = prev;
+                       q = &dqptr->request_queue;
+                       spin_lock_irqsave(q->queue_lock, &flags);
+                       dqptr->access_count--;
+                       spin_unlock_irqrestore(q->queue_lock, &flags);

Comment 2 Matt Domsch 2004-12-14 22:14:46 UTC
The above patch won't quite cut it, of course.  Because the NULL deref
happens while walking a list of Scsi_Devices, it's very likely that
the remove happens to a device further along the chain than the
current device.  So what needs to be protected for deletion is the
list of devices (i.e. convert to struct list_head, use
list_for_each_entry_safe() here to be safe from deletion).  The access
count aspect would protect against add/remove to the same device, so
is still necessary, but would be a different failure case.

Comment 3 Doug Ledford 2004-12-15 14:29:00 UTC
No, this is a common misconception about list_for_each_entry_safe(). 
It *only* protects against deletions done inside the iteration loop,
it is *not* safe against deletions from other code paths. 
Furthermore, it is only safe against deletion of the currently return
list item, it does *not* protect against say deletion of the next item
in the list.  If you were to delete the next item in the list, then
list_for_each_entry_safe would happily use a next pointer to an
already freed object and, depending things like SLAB poisoning being
enabled, promptly blow up with an invalid pointer.  To get SMP safety,
you need locking around the list.

The real answer to this (at least the one I'm looking at for RHEL3,
and I'm not sure AS2.1 is ever going to get this fully fixed, it's
simply too invasive for a product in the stage of life cycle that
AS2.1 is in) is to A) serialize remove/add operations in the mid
layer, B) unhook any removed devices immediately, but leave the ->next
entry valid and pointing to what would be next in the list, C) create
a delayed work queue entry and actually free the device at some point
in the future when all threads of execution that may have gotten a
reference to the removed device have completed their iteration loops
and it's safe to kfree() the scsi device struct.  This implies that
the work queue needs to wait until no hosts are in error recovery
since error recover can sleep in the middle of their iteration loops
while no other place in the scsi stack is allowed to sleep in the
middle of an iteration of the scsi devices on a host.  So, if you make
sure that there are no error recovery operations in progress, then
something simple like scheduling the removal for HZ jiffies in the
future will ensure that there are no code paths holding a reference to
the device and no oops will occur.

Comment 4 Jim Paradis 2005-02-10 04:29:09 UTC
Since the proper fix is too invasive for an update, what would be the
implications of closing as WONTFIX?

Note You need to log in before you can comment on or make changes to this bug.