Red Hat Bugzilla – Bug 176879
shost->host_busy count is set to -1
Last modified: 2007-11-30 17:07:22 EST
Description of problem:
System hangs spinning in the belief that too many I/Os have been
issued to a device. This is caused by the shost->host_busy count
being set to -1.
Version-Release number of selected component (if applicable):
System hang happens everytime system is booted with hardware
configuration given below.
Steps to Reproduce:
With a system connected to a SCSI device connected to an HBA using
the mptscsi driver where the device does not support the REPORT_LUNs
command but does have more than the expected number of LUNs expected
by the mptscsi driver for a device with bus_type of SCSI.
Let the SCSI device have 256 LUNs as defined by the scsi_static_device_list.
The mptscsi driver believes that the HBA supports at most 64 LUNs.
When the SCSI scan code issues an INQUIRY to the 65th LUN, the queuecommand()
will fail. At that time, the host_busy value gets set to -1. I did not
investigate how this occured. I just printed the shost->host_busy value and
saw that it was equal to 65535.
I expect that this problem will be reproducible for other situations
where a driver queuecommand() entry point returns a error during the
SCSI LUN probe sequence.
SCSI LUN probe process should complete and system should boot.
The problem was resolved by adding the BLIST_REPORTLUN2 flag to the
entry in the scsi_static_device_list. This prevented the SCSI scan
code from issuing INQUIRYs to LUNs > 64.
Looks to me that the problem is that mptscsih_qcmd, the LSILogic queuecommand
callout in mptscsih.c, is both calling the scsi command's io done callback and
returning FAILED to the scsi mid-layer's queuecommand function whenever
the LUN of the command is greater than the mptscsih configured/derived
"last lun". It should be doing one or the other but certainly not both.
Doing both causes both the scsi mid-layer's host_busy and device_busy
values for the adapter's host structure to be decremented twice instead
of once -- thereby causing the -1.
Failing the command with a scsi status of DID_BAD_TARGET, calling the io
done callback on the failed command, and returning 0 to queuecommand is
the thing to do. This amounts to a one line change to the RHEL4
mptscsih.c driver -- returning 0 instead of FAILED whenever the
command's LUN is greater than "last_lun".
Created attachment 146313 [details]
patch that implements the recommendation from comment #1
I've build a kernel that incorporates the patch in comment #3:
If someone at VMWare (Tom Phelan or Ed Groggin?) could please verify that the
problem is fixed in the test kernel, and if Eric Moore at LSI is willing to sign
off on the patch, then I will submit the patch for inclusion in the RHEL4.5 kernel.
Chip, when should this verify be done by in order to have the patch included in
the RHEL4 U5 distro?
(In reply to comment #5)
> Chip, when should this verify be done by in order to have the patch included in
> the RHEL4 U5 distro?
As soon as possible. I really should have had that patch submitted before
Christmas (my fault).
Chip - Sorry for being late, however I just came to know about
this patch on 1/23/2007. The suggested patch in comment #3 is fine. We've
added this patch to our internal driver stream.
committed in stream U5 build 45. A test kernel with this patch is available from
QE ack for 4.5.
Patch is in the -52 kernel.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.