Bug 88995

Summary: Linux Kernel panic when SCSI Target returns Queue Full
Product: Red Hat Enterprise Linux 2.1 Reporter: Paul Kilty <kiltyp>
Component: kernelAssignee: Tom Coughlan <coughlan>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: jmoyer, shillman
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-10-25 14:25:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 87937    

Description Paul Kilty 2003-04-16 07:36:55 UTC
Description of problem:
Function scsi_mlqueue_insert() does not restore data direction before replacing
task on queue when device is busy.

Version-Release number of selected component (if applicable):

Linux 2.4.9-e16enterprise

How reproducible: Very reproducible if the path taken by the code is device
busy.


Steps to Reproduce:
1. Get scsi target to bounce a task Queue Full
2. Use Finisar trace to detect data direction (or wait for kernel crash!)
3.
    
Actual results:

Linux kernel crash with message;
Kernel panic:scsi_free:Bad offset

Expected results:
(You are joking, aren't you!)

Additional info:

The fix is as follows;  The last few lines of scsi_mlqueue_insert() should read;

        /*
         * Decrement the counters, since these commands are no longer 
         * active on the host/device.
         */
        atomic_dec(&cmd->host->host_busy);
        atomic_dec(&cmd->device->device_busy);

        /*
         * Restore the direction
         */
        cmd->sc_data_direction = cmd->sc_old_data_direction;

        /* 
         * Insert this command at the head of the queue for it's device.
         * It will go before all other commands that are already in the queue.
         */
        scsi_insert_special_cmd(cmd, 1);
        return 0;

The fix is to restore the direction.  This fix has been tested and it is OK.
Compare this statement with the similar one in scsi_retry_command(), which is
the path taken when the device is not busy.

Comment 1 Tim Pepper 2003-04-22 16:46:37 UTC
Is there any way to get this fix into the RHAS quarterly update?

Comment 2 Tim Pepper 2003-04-30 17:04:40 UTC
Actually, this patch is probably an incorrect bandaid.  I'm still seeing issues
withit.

See the linux-scsi list from today...Mike Anderson proposes a better patch.

Comment 3 Need Real Name 2003-04-30 22:56:10 UTC
Here's the patch from Mike Anderson:

--- linux-2.4/drivers/scsi/scsi_lib.c		Wed Apr 30 09:04:16 2003
+++ linux-2.4-p/drivers/scsi/scsi_lib.c		Wed Apr 30 09:04:17 2003
@@ -256,6 +256,17 @@
 	if (SCpnt != NULL) {
 
 		/*
+		 * This is a work around for a case where this Scsi_Cmnd
+		 * may have been through the busy retry paths already. We
+		 * clear the special flag and try to restore the
+		 * read/write request cmd value.
+		 */
+		if (SCpnt->request.cmd == SPECIAL)
+			SCpnt->request.cmd = 
+				(SCpnt->sc_data_direction ==
+				 SCSI_DATA_WRITE) ? WRITE : READ;
+
+		/*
 		 * For some reason, we are not done with this request.
 		 * This happens for I/O errors in the middle of the request,
 		 * in which case we need to request the blocks that come after

Comment 4 Paul Kilty 2003-05-01 07:22:47 UTC
Following a QueueFull returned to Host, I also noticed that the Host would
sometimes never resubmit any IO to the offending Lun.  This was thought
originally to be an adapter problem, but in view of recent comments on this bug,
please consider if this is also an error in the scsi mid-layer code.

The HBA was a QLA2300, ouput from /proc/scsi/qla2300/1 follows;

QLogic PCI to Fibre Channel Host Adapter for ISP23xx:
        Firmware version:  3.01.18, Driver version 6.05.00b7-debug
Entry address = f89a9060
HBA: QLA2300 , Serial# D82006


Comment 6 Tom Coughlan 2003-08-19 13:59:51 UTC
This needs to be addressed in the next AS 2.1 update.

Comment 7 Jeff Moyer 2003-10-24 18:34:30 UTC
Paul, have you tried the above patch (from Mike Anderson)?  Does it solve you
problem?  Also, the second problem you list should probably be a separate bugzilla.

Thanks.

Jeff

Comment 8 Jeff Moyer 2003-10-24 18:41:55 UTC
One other thing.  You mention that the kernel panic's.  Do you have the output
from such a panic?  This would be helpful.

Comment 9 Tom Coughlan 2003-12-02 13:24:59 UTC
We have been unable to reproduce this problem, even with the
scsi_debug driver, and hence cannot verify the fix. 

It is not clear from the comments above whether the proposed fix is
adequate, and whether it has been verified by Paul at IBM.  
As a result, this patch is not in U3. 

Messing with scsi_request_fn() is risky.  To make a change we will
need to show both A) that the source of the problem is fully
understood and B) why the fix is the correct fix for the problem.  

Comment 10 Paul Kilty 2003-12-16 12:07:50 UTC
In reply to your questions;
1. No I have not had time to try the above patch. I did not realise
that this as required of me. However I will try and arrange it as soon
as possible if this is still required.

2. The only information I obtained from the error was the text string
reported above "scsi_free:Bad offset".

3. It is possible for me to limit resources at a target scsi device so
that it will return queue full.  Thus I can subject any fix to this
simple unit test.


Comment 11 Tom Coughlan 2004-10-25 14:25:35 UTC
There has been no reply for 10 months. If this problem still exists
please reopen this BZ, with feedback on the proposed patch.