From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3.1)
Description of problem:
If the initialization of a command structure in
scsi_lib.c:scsi_request_fn() fails, the struct
request is dequeued, but not released, causing
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Queue up lots of I/O to a device
2. Offline that device
Actual Results: Assuming enough I/O has been queued, the requests
for the queue may be exausted. If you happen to
have processes sleeping waiting for requests to
free up on this device, they will sleep forever,
never noticing that the device has been offlined.
Expected Results: Requests for failed transactions should be released.
Looking at the latest 2.4.X kernel it seems that
scsi_request_fn() has been cleaned up such that cleanup
on error is much simpler. One fix is to re-sync with
the stock 2.4.X code. The less risky fix is to add a
call to blkdev_release_request() near line 1222 after
the request is dequeued if the request is not embedded
in the SCSI command structure.
You need to be more specific. Are you referring to it failing to init
the command as part of SDpnt->scsi_init_io_fn(SCpnt) or as
STpnt->init_command(SCpnt)? If the first, then the command is left on
the request queue, so freeing it is the wrong thing to do (this is
considered a temporary failure and we want to retry the command
later). If it's the later, then the code I'm looking at calls
blkdev_dequeue_request(), scsi_release_buffers(), and
__end_scsi_request(), which should do what you want.
I am talking about the latter. Neither scsi_release_buffers() nor
have the ability to release the incoming request since it is not
referenced in any way
by the scsi_command structure (it may only have a copy). That is why
must be freed by blkdev_release_request(). See similar code later in
that calls blkdev_release_request() explicitly if the request embedded
scsi_command structure is only a copy.
Justin, I added a call to blkdev_release_request() in the area you
were talking about. This should be fixed now. If you want to check
for yourself, the source is located in a bk tree on
Justin, someone pointed out to me that my last entry was vague. I
should note that the change I referenced is *not* in RHEL3 U2, but
instead is a proposed fix for U3. The BK tree I referenced was
originally populated with the U2 source code and then contains the
patches I've made so far in preparation for U3. Contrary to previous
updates, I'm doing the work on the SCSI stack as a whole compilation
vs. individual patches in order to make sure some overlapping bugs
I've got all get solved properly and don't cause unintended
interactions. For that reason, it's easier just to have people that
want to test and make sure their particular problem is solved pull
from the bk tree than anything else.
Thanks for the clarification, Doug. I'm putting this bug back into
"assigned" state to reflect that a fix has not yet been committed to
the U3 kernel patch pool.
A fix for this problem has been committed to the RHEL3 U3
patch pool this evening (in kernel version 2.4.21-15.6.EL).
*** Bug 129174 has been marked as a duplicate of this bug. ***
An errata has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.