Description of problem: System hangs spinning in the belief that too many I/Os have been issued to a device. This is caused by the shost->host_busy count being set to -1. Version-Release number of selected component (if applicable): RHEL4 U2. How reproducible: System hang happens everytime system is booted with hardware configuration given below. Steps to Reproduce: 1. 2. 3. With a system connected to a SCSI device connected to an HBA using the mptscsi driver where the device does not support the REPORT_LUNs command but does have more than the expected number of LUNs expected by the mptscsi driver for a device with bus_type of SCSI. Let the SCSI device have 256 LUNs as defined by the scsi_static_device_list. The mptscsi driver believes that the HBA supports at most 64 LUNs. When the SCSI scan code issues an INQUIRY to the 65th LUN, the queuecommand() will fail. At that time, the host_busy value gets set to -1. I did not investigate how this occured. I just printed the shost->host_busy value and saw that it was equal to 65535. I expect that this problem will be reproducible for other situations where a driver queuecommand() entry point returns a error during the SCSI LUN probe sequence. Actual results: System hangs Expected results: SCSI LUN probe process should complete and system should boot. Additional info: The problem was resolved by adding the BLIST_REPORTLUN2 flag to the entry in the scsi_static_device_list. This prevented the SCSI scan code from issuing INQUIRYs to LUNs > 64.
Looks to me that the problem is that mptscsih_qcmd, the LSILogic queuecommand callout in mptscsih.c, is both calling the scsi command's io done callback and returning FAILED to the scsi mid-layer's queuecommand function whenever the LUN of the command is greater than the mptscsih configured/derived "last lun". It should be doing one or the other but certainly not both. Doing both causes both the scsi mid-layer's host_busy and device_busy values for the adapter's host structure to be decremented twice instead of once -- thereby causing the -1. Failing the command with a scsi status of DID_BAD_TARGET, calling the io done callback on the failed command, and returning 0 to queuecommand is the thing to do. This amounts to a one line change to the RHEL4 mptscsih.c driver -- returning 0 instead of FAILED whenever the command's LUN is greater than "last_lun".
Created attachment 146313 [details] patch that implements the recommendation from comment #1
I've build a kernel that incorporates the patch in comment #3: http://people.redhat.com/coldwell/bugs/kernel/176879/ If someone at VMWare (Tom Phelan or Ed Groggin?) could please verify that the problem is fixed in the test kernel, and if Eric Moore at LSI is willing to sign off on the patch, then I will submit the patch for inclusion in the RHEL4.5 kernel. Thanks, Chip
Chip, when should this verify be done by in order to have the patch included in the RHEL4 U5 distro? Thanks, Ed
(In reply to comment #5) > Chip, when should this verify be done by in order to have the patch included in > the RHEL4 U5 distro? As soon as possible. I really should have had that patch submitted before Christmas (my fault). Chip
Chip - Sorry for being late, however I just came to know about this patch on 1/23/2007. The suggested patch in comment #3 is fine. We've added this patch to our internal driver stream.
committed in stream U5 build 45. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
QE ack for 4.5.
Patch is in the -52 kernel.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html