Bug 176879 - shost->host_busy count is set to -1
Summary: shost->host_busy count is set to -1
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Chip Coldwell
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 217097
TreeView+ depends on / blocked
 
Reported: 2006-01-03 22:39 UTC by tom phelan
Modified: 2007-11-30 22:07 UTC (History)
5 users (show)

Fixed In Version: RHBA-2007-0304
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-05-02 00:02:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch that implements the recommendation from comment #1 (524 bytes, patch)
2007-01-23 15:43 UTC, Chip Coldwell
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0304 0 normal SHIPPED_LIVE Updated kernel packages available for Red Hat Enterprise Linux 4 Update 5 2007-04-28 18:58:50 UTC

Description tom phelan 2006-01-03 22:39:02 UTC
Description of problem:
  System hangs spinning in the belief that too many I/Os have been
  issued to a device. This is caused by the shost->host_busy count
  being set to -1.

Version-Release number of selected component (if applicable):
RHEL4 U2.

How reproducible:
System hang happens everytime system is booted with hardware
configuration given below.

Steps to Reproduce:
1.
2.
3.
  With a system connected to a SCSI device connected to an HBA using 
  the mptscsi driver where the device does not support the REPORT_LUNs
  command but does have more than the expected number of LUNs expected
  by the mptscsi driver for a device with bus_type of SCSI.

  Let the SCSI device have 256 LUNs as defined by the scsi_static_device_list.
  The mptscsi driver believes that the HBA supports at most 64 LUNs.
  When the SCSI scan code issues an INQUIRY to the 65th LUN, the queuecommand()
  will fail. At that time, the host_busy value gets set to -1. I did not
  investigate how this occured. I just printed the shost->host_busy value and
  saw that it was equal to 65535. 

  I expect that this problem will be reproducible for other situations
  where a driver queuecommand() entry point returns a error during the
  SCSI LUN probe sequence.
  
Actual results:
System hangs

Expected results:
SCSI LUN probe process should complete and system should boot.

Additional info:
The problem was resolved by adding the BLIST_REPORTLUN2 flag to the
entry in the scsi_static_device_list. This prevented the SCSI scan
code from issuing INQUIRYs to LUNs > 64.

Comment 1 Ed Goggin 2007-01-05 16:39:20 UTC
Looks to me that the problem is that mptscsih_qcmd, the LSILogic queuecommand
callout in mptscsih.c, is both calling the scsi command's io done callback and
returning FAILED to the scsi mid-layer's queuecommand function whenever
the LUN of the command is greater than the mptscsih configured/derived
"last lun".  It should be doing one or the other but certainly not both.
Doing both causes both the scsi mid-layer's host_busy and device_busy
values for the adapter's host structure to be decremented twice instead
of once -- thereby causing the -1.

Failing the command with a scsi status of DID_BAD_TARGET, calling the io
done callback on the failed command, and returning 0 to queuecommand is
the thing to do.  This amounts to a one line change to the RHEL4
mptscsih.c driver -- returning 0 instead of FAILED whenever the
command's LUN is greater than "last_lun".

Comment 3 Chip Coldwell 2007-01-23 15:43:46 UTC
Created attachment 146313 [details]
patch that implements the recommendation from comment #1

Comment 4 Chip Coldwell 2007-01-23 17:35:03 UTC
I've build a kernel that incorporates the patch in comment #3:

http://people.redhat.com/coldwell/bugs/kernel/176879/

If someone at VMWare (Tom Phelan or Ed Groggin?) could please verify that the
problem is fixed in the test kernel, and if Eric Moore at LSI is willing to sign
off on the patch, then I will submit the patch for inclusion in the RHEL4.5 kernel.

Thanks,

Chip


Comment 5 Ed Goggin 2007-01-23 20:31:34 UTC
Chip, when should this verify be done by in order to have the patch included in
the RHEL4 U5 distro?

Thanks,

Ed

Comment 6 Chip Coldwell 2007-01-23 21:03:09 UTC
(In reply to comment #5)
> Chip, when should this verify be done by in order to have the patch included in
> the RHEL4 U5 distro?

As soon as possible.  I really should have had that patch submitted before
Christmas (my fault).

Chip


Comment 8 Eric Moore 2007-01-29 18:31:24 UTC
Chip - Sorry for being late, however I just came to know about
this patch on 1/23/2007.  The suggested patch in comment #3 is fine.  We've 
added this patch to our internal driver stream.

Comment 9 Jason Baron 2007-02-01 19:33:38 UTC
committed in stream U5 build 45. A test kernel with this patch is available from
http://people.redhat.com/~jbaron/rhel4/


Comment 10 Jay Turner 2007-02-05 19:10:26 UTC
QE ack for 4.5.

Comment 12 Mike Gahagan 2007-04-02 18:28:15 UTC
Patch is in the -52 kernel.


Comment 14 Red Hat Bugzilla 2007-05-02 00:02:02 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html



Note You need to log in before you can comment on or make changes to this bug.