Description of problem: Outlined in IssueTracker 45654. There were two races described there, against two OS versions. This bugzilla is to address only one of those scenarios, RHEL 2.1 <= Update 6 (kernel 2.4.9-e.57) and the race between scsi-remove-single-device and scsi-add-single-device going to the same device. This is a snipped of disassembly/source from scsi_scan.c in the function scan_scsis(). We observe a NULL pointer deference in the code below when multiple instances of a script which adds and removes SCSI devices is run. for (dqptr = shpnt->host_queue; dqptr != SDpnt; dqptr = dqptr->next) b6fd: 8b 8c 24 28 01 00 00 mov 0x128(%esp,1),%ecx b704: 8b 51 04 mov 0x4(%ecx),%edx b707: 39 c2 cmp %eax,%edx b709: 74 0b je b716 <scan_scsis+0x3c6> b70b: 89 c1 mov %eax,%ecx b70d: 8d 76 00 lea 0x0(%esi),%esi b710: 8b 12 mov (%edx),%edx (the above line is the NULL ptr dereference, %edx=0) b712: 39 ca cmp %ecx,%edx b714: 75 fa jne b710 <scan_scsis+0x3c0> continue; Version-Release number of selected component (if applicable): 2.4.9-e.57smp How reproducible: easy Steps to Reproduce: 1. configure system with lots of disks, no file systems on them 2. run multiple instances of scanbus script 3. crash in <2 minutes Actual results: Null ptr dereference and kernel crash Expected results: no crash
This is the result of following the linked list of Scsi_Devices on the host, and finding that the ->next pointer is NULL but never testing for that. First question is, why did the remove succeed while we were using it, and the answer is that SDpnt->access_count was never incremented at SDpnt allocation, though we are using it. So, fix that, and decrement it later. Throw in the NULL case check when walking the ->next ptr, and we should be better. Doug, how's this look? I'll have our test team try this patch. --- scsi_scan.c.~1~ Thu Dec 2 19:25:29 2004 +++ scsi_scan.c Tue Dec 14 08:00:18 2004 @@ -401,6 +401,7 @@ void scan_scsis(struct Scsi_Host *shpnt, SDpnt->queue_depth = 1; SDpnt->host = shpnt; SDpnt->online = TRUE; + SDpnt->access_count = 1; initialize_merge_fn(SDpnt); @@ -521,8 +522,10 @@ void scan_scsis(struct Scsi_Host *shpnt, { /* Unchain SRpnt from host_queue */ Scsi_Device *prev, *next; Scsi_Device *dqptr; + request_queue_t *q; + unsigned long flags; - for (dqptr = shpnt->host_queue; dqptr != SDpnt; dqptr = dqptr->next) + for (dqptr = shpnt->host_queue; dqptr && dqptr != SDpnt; dqptr = dqptr->next) continue; if (dqptr) { prev = dqptr->prev; @@ -533,6 +536,11 @@ void scan_scsis(struct Scsi_Host *shpnt, shpnt->host_queue = next; if (next) next->prev = prev; + + q = &dqptr->request_queue; + spin_lock_irqsave(q->queue_lock, &flags); + dqptr->access_count--; + spin_unlock_irqrestore(q->queue_lock, &flags); } }
The above patch won't quite cut it, of course. Because the NULL deref happens while walking a list of Scsi_Devices, it's very likely that the remove happens to a device further along the chain than the current device. So what needs to be protected for deletion is the list of devices (i.e. convert to struct list_head, use list_for_each_entry_safe() here to be safe from deletion). The access count aspect would protect against add/remove to the same device, so is still necessary, but would be a different failure case.
No, this is a common misconception about list_for_each_entry_safe(). It *only* protects against deletions done inside the iteration loop, it is *not* safe against deletions from other code paths. Furthermore, it is only safe against deletion of the currently return list item, it does *not* protect against say deletion of the next item in the list. If you were to delete the next item in the list, then list_for_each_entry_safe would happily use a next pointer to an already freed object and, depending things like SLAB poisoning being enabled, promptly blow up with an invalid pointer. To get SMP safety, you need locking around the list. The real answer to this (at least the one I'm looking at for RHEL3, and I'm not sure AS2.1 is ever going to get this fully fixed, it's simply too invasive for a product in the stage of life cycle that AS2.1 is in) is to A) serialize remove/add operations in the mid layer, B) unhook any removed devices immediately, but leave the ->next entry valid and pointing to what would be next in the list, C) create a delayed work queue entry and actually free the device at some point in the future when all threads of execution that may have gotten a reference to the removed device have completed their iteration loops and it's safe to kfree() the scsi device struct. This implies that the work queue needs to wait until no hosts are in error recovery since error recover can sleep in the middle of their iteration loops while no other place in the scsi stack is allowed to sleep in the middle of an iteration of the scsi devices on a host. So, if you make sure that there are no error recovery operations in progress, then something simple like scheduling the removal for HZ jiffies in the future will ensure that there are no code paths holding a reference to the device and no oops will occur.
Since the proper fix is too invasive for an update, what would be the implications of closing as WONTFIX?