Bug 176879

Summary: shost->host_busy count is set to -1
Product: Red Hat Enterprise Linux 4 Reporter: tom phelan <tap>
Component: kernelAssignee: Chip Coldwell <coldwell>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: andriusb, coughlan, egoggin, eric.moore, jbaron
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0304 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-05-02 00:02:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 217097    
Attachments:
Description Flags
patch that implements the recommendation from comment #1 none

Description tom phelan 2006-01-03 22:39:02 UTC
Description of problem:
  System hangs spinning in the belief that too many I/Os have been
  issued to a device. This is caused by the shost->host_busy count
  being set to -1.

Version-Release number of selected component (if applicable):
RHEL4 U2.

How reproducible:
System hang happens everytime system is booted with hardware
configuration given below.

Steps to Reproduce:
1.
2.
3.
  With a system connected to a SCSI device connected to an HBA using 
  the mptscsi driver where the device does not support the REPORT_LUNs
  command but does have more than the expected number of LUNs expected
  by the mptscsi driver for a device with bus_type of SCSI.

  Let the SCSI device have 256 LUNs as defined by the scsi_static_device_list.
  The mptscsi driver believes that the HBA supports at most 64 LUNs.
  When the SCSI scan code issues an INQUIRY to the 65th LUN, the queuecommand()
  will fail. At that time, the host_busy value gets set to -1. I did not
  investigate how this occured. I just printed the shost->host_busy value and
  saw that it was equal to 65535. 

  I expect that this problem will be reproducible for other situations
  where a driver queuecommand() entry point returns a error during the
  SCSI LUN probe sequence.
  
Actual results:
System hangs

Expected results:
SCSI LUN probe process should complete and system should boot.

Additional info:
The problem was resolved by adding the BLIST_REPORTLUN2 flag to the
entry in the scsi_static_device_list. This prevented the SCSI scan
code from issuing INQUIRYs to LUNs > 64.

Comment 1 Ed Goggin 2007-01-05 16:39:20 UTC
Looks to me that the problem is that mptscsih_qcmd, the LSILogic queuecommand
callout in mptscsih.c, is both calling the scsi command's io done callback and
returning FAILED to the scsi mid-layer's queuecommand function whenever
the LUN of the command is greater than the mptscsih configured/derived
"last lun".  It should be doing one or the other but certainly not both.
Doing both causes both the scsi mid-layer's host_busy and device_busy
values for the adapter's host structure to be decremented twice instead
of once -- thereby causing the -1.

Failing the command with a scsi status of DID_BAD_TARGET, calling the io
done callback on the failed command, and returning 0 to queuecommand is
the thing to do.  This amounts to a one line change to the RHEL4
mptscsih.c driver -- returning 0 instead of FAILED whenever the
command's LUN is greater than "last_lun".

Comment 3 Chip Coldwell 2007-01-23 15:43:46 UTC
Created attachment 146313 [details]
patch that implements the recommendation from comment #1

Comment 4 Chip Coldwell 2007-01-23 17:35:03 UTC
I've build a kernel that incorporates the patch in comment #3:

http://people.redhat.com/coldwell/bugs/kernel/176879/

If someone at VMWare (Tom Phelan or Ed Groggin?) could please verify that the
problem is fixed in the test kernel, and if Eric Moore at LSI is willing to sign
off on the patch, then I will submit the patch for inclusion in the RHEL4.5 kernel.

Thanks,

Chip


Comment 5 Ed Goggin 2007-01-23 20:31:34 UTC
Chip, when should this verify be done by in order to have the patch included in
the RHEL4 U5 distro?

Thanks,

Ed

Comment 6 Chip Coldwell 2007-01-23 21:03:09 UTC
(In reply to comment #5)
> Chip, when should this verify be done by in order to have the patch included in
> the RHEL4 U5 distro?

As soon as possible.  I really should have had that patch submitted before
Christmas (my fault).

Chip


Comment 8 Eric Moore 2007-01-29 18:31:24 UTC
Chip - Sorry for being late, however I just came to know about
this patch on 1/23/2007.  The suggested patch in comment #3 is fine.  We've 
added this patch to our internal driver stream.

Comment 9 Jason Baron 2007-02-01 19:33:38 UTC
committed in stream U5 build 45. A test kernel with this patch is available from
http://people.redhat.com/~jbaron/rhel4/


Comment 10 Jay Turner 2007-02-05 19:10:26 UTC
QE ack for 4.5.

Comment 12 Mike Gahagan 2007-04-02 18:28:15 UTC
Patch is in the -52 kernel.


Comment 14 Red Hat Bugzilla 2007-05-02 00:02:02 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html