Description of problem: Function scsi_mlqueue_insert() does not restore data direction before replacing task on queue when device is busy. Version-Release number of selected component (if applicable): Linux 2.4.9-e16enterprise How reproducible: Very reproducible if the path taken by the code is device busy. Steps to Reproduce: 1. Get scsi target to bounce a task Queue Full 2. Use Finisar trace to detect data direction (or wait for kernel crash!) 3. Actual results: Linux kernel crash with message; Kernel panic:scsi_free:Bad offset Expected results: (You are joking, aren't you!) Additional info: The fix is as follows; The last few lines of scsi_mlqueue_insert() should read; /* * Decrement the counters, since these commands are no longer * active on the host/device. */ atomic_dec(&cmd->host->host_busy); atomic_dec(&cmd->device->device_busy); /* * Restore the direction */ cmd->sc_data_direction = cmd->sc_old_data_direction; /* * Insert this command at the head of the queue for it's device. * It will go before all other commands that are already in the queue. */ scsi_insert_special_cmd(cmd, 1); return 0; The fix is to restore the direction. This fix has been tested and it is OK. Compare this statement with the similar one in scsi_retry_command(), which is the path taken when the device is not busy.
Is there any way to get this fix into the RHAS quarterly update?
Actually, this patch is probably an incorrect bandaid. I'm still seeing issues withit. See the linux-scsi list from today...Mike Anderson proposes a better patch.
Here's the patch from Mike Anderson: --- linux-2.4/drivers/scsi/scsi_lib.c Wed Apr 30 09:04:16 2003 +++ linux-2.4-p/drivers/scsi/scsi_lib.c Wed Apr 30 09:04:17 2003 @@ -256,6 +256,17 @@ if (SCpnt != NULL) { /* + * This is a work around for a case where this Scsi_Cmnd + * may have been through the busy retry paths already. We + * clear the special flag and try to restore the + * read/write request cmd value. + */ + if (SCpnt->request.cmd == SPECIAL) + SCpnt->request.cmd = + (SCpnt->sc_data_direction == + SCSI_DATA_WRITE) ? WRITE : READ; + + /* * For some reason, we are not done with this request. * This happens for I/O errors in the middle of the request, * in which case we need to request the blocks that come after
Following a QueueFull returned to Host, I also noticed that the Host would sometimes never resubmit any IO to the offending Lun. This was thought originally to be an adapter problem, but in view of recent comments on this bug, please consider if this is also an error in the scsi mid-layer code. The HBA was a QLA2300, ouput from /proc/scsi/qla2300/1 follows; QLogic PCI to Fibre Channel Host Adapter for ISP23xx: Firmware version: 3.01.18, Driver version 6.05.00b7-debug Entry address = f89a9060 HBA: QLA2300 , Serial# D82006
This needs to be addressed in the next AS 2.1 update.
Paul, have you tried the above patch (from Mike Anderson)? Does it solve you problem? Also, the second problem you list should probably be a separate bugzilla. Thanks. Jeff
One other thing. You mention that the kernel panic's. Do you have the output from such a panic? This would be helpful.
We have been unable to reproduce this problem, even with the scsi_debug driver, and hence cannot verify the fix. It is not clear from the comments above whether the proposed fix is adequate, and whether it has been verified by Paul at IBM. As a result, this patch is not in U3. Messing with scsi_request_fn() is risky. To make a change we will need to show both A) that the source of the problem is fully understood and B) why the fix is the correct fix for the problem.
In reply to your questions; 1. No I have not had time to try the above patch. I did not realise that this as required of me. However I will try and arrange it as soon as possible if this is still required. 2. The only information I obtained from the error was the text string reported above "scsi_free:Bad offset". 3. It is possible for me to limit resources at a target scsi device so that it will return queue full. Thus I can subject any fix to this simple unit test.
There has been no reply for 10 months. If this problem still exists please reopen this BZ, with feedback on the proposed patch.