Bug 480158 - RHEL 4.8 mpt driver fails to bring up device
Summary: RHEL 4.8 mpt driver fails to bring up device
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.8
Hardware: All
OS: Linux
high
medium
Target Milestone: rc
: 4.8
Assignee: Tomas Henzl
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 445361
TreeView+ depends on / blocked
 
Reported: 2009-01-15 14:25 UTC by Vivek Goyal
Modified: 2009-09-03 14:11 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-05-18 19:25:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
disable msi (272 bytes, patch)
2009-01-16 14:41 UTC, Tomas Henzl
no flags Details | Diff
The patch is to fix an issue of incorrectly setting DMA mask for 106XE controllers (7.82 KB, patch)
2009-01-20 07:47 UTC, Sathya Prakash
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1024 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update 2009-05-18 14:57:26 UTC

Description Vivek Goyal 2009-01-15 14:25:09 UTC
Description of problem:
mpt driver fails to bring up the device and fails in mptbase.

RHTS logs link

http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/distribution/kernelinstall&result=Fail&rwhiteboard=kernel%202.6.9-78.30.EL.vgoyal.test4&arch=x86_64&jobids=41921


Loading mptbase.ko module
Fusion MPT base driver 3.12.29.00rh
Copyright (c) 1999-2008 LSI Corporation
Loading mptscsi.ko module
Loading mptspi.ko module
Fusion MPT SPI Host driver 3.12.29.00rh
Loading mptsas.ko module
Fusion MPT SAS Host driver 3.12.29.00rh
ACPI: PCI Interrupt 0000:06:00.0[A] -> GSI 35 (level, low) -> IRQ 185
ACPI: PCI Interrupt 0000:06:00.0[A] -> GSI 35 (level, low) -> IRQ 185
mptbase: Initiating ioc0 bringup
ioc0: SAS1064E: Capabilities={Initiator}
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: Initiating ioc0 recovery
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: Initiating ioc0 recovery
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: Initiating ioc0 recovery
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: Initiating ioc0 recovery
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: Initiating ioc0 recovery
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptbase: Initiating ioc0 recovery
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
scsi0 : ioc0: LSISAS1064E, FwRev=010a0000h, Ports=1, MaxQ=268, IRQ=185
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!
mptscsi: ioc0: attempting task abort! (sc=000001025f465040)
scsi0 : destination target 0, lun 0
        command = Inquiry 00 00 00 24 00 
mptbase: mpt_reply: WARNING - ioc0: Invalid cb_idx (0)!

Version-Release number of selected component (if applicable):

Happened on test kernel 2.6.9-78.30.EL.vgoyal.test4. This is pramrily some patches on top of 29.EL. I suspect it happened because of mpt patches which went in previous versions.

How reproducible:
Saw once.


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Tomas Henzl 2009-01-16 14:24:07 UTC
In RHEL5 we had also a issue with mpt - Bug 474465. Maybe it is a similar problem with msi enabled by default - this is new in mpt 3.12.29.

Comment 3 Tomas Henzl 2009-01-16 14:41:22 UTC
Created attachment 329209 [details]
disable msi

This patch disables msi, I think it's worth - we already have had similar problems in rh5.

Comment 4 Tomas Henzl 2009-01-16 14:47:53 UTC
(In reply to comment #3)
> This patch disables msi, I think it's worth - we already have had similar
> problems in rh5.
I only wanted to say that it should be tested, not that it is worth itself, my English is bad, I'll better stop explaining :)

Comment 5 Tomas Henzl 2009-01-16 17:02:49 UTC
I've just tested the msi disable patch without success. I have only found that disabling the whole patch linux-2.6.9-mptfusion-update-mpt-fusion-to-version-3.12.29.00rh.patch makes the box work again.

Vivek,
should we continue here or reopen bz452163 ?

Rob,
the machine is yours again.

Comment 7 Sathya Prakash 2009-01-19 15:09:30 UTC
Anyone point me to the link where I can download the kernel with the driver. I will locally try to reproduce and look in further.
Thanks
Sathya

Comment 8 Vivek Goyal 2009-01-19 15:27:26 UTC
(In reply to comment #7)
> Anyone point me to the link where I can download the kernel with the driver. I
> will locally try to reproduce and look in further.

http://people.redhat.com/vgoyal/rhel4/

Comment 9 Tomas Henzl 2009-01-19 17:11:24 UTC
It looks to me that the problem is in mpt_attach. The change to using mpt_mapresources instead of dealing with resources in mpt_attach looks suspicious. We are now calling pci_enable_device(pdev) in mpt_attach and then in mpt_mapresources, maybe the patch to mpt_mapresources was somehow inaccurate.

Comment 10 Rob Evers 2009-01-19 18:55:07 UTC
This patch was tested successfully on a system with the following hba:

[root@dl585-03 ~]# lspci | grep -i lsi
07:09.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X
Fusion-MPT SAS (rev 01)

With this patch the system hangs (as described above) when tested on a system
with the following hba:

[root@amd-shanghai-01 ~]# lspci | grep -i lsi
06:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064ET
PCI-Express Fusion-MPT SAS (rev 02)
[root@amd-shanghai-01 ~]#

The problem occurs consistently when observed.

Requested LSI reproduce this problem.

(Shifting to use this bug report to track this problem from the original patch request, bz452163.)

Comment 11 Sathya Prakash 2009-01-20 05:06:25 UTC
It looks like an issue we capture internally. Is it occuring only with systems having greater than 4GB RAM and only with PCI-E cards?

The internal defect description is as below.

"When the 3.12.29.XX driver is loaded in systems having 4GB or greater, the system became unresponsive.   The reason for this problem is, the 106E B1 and older chip have errata in the device driver where the driver forces all data transfers to be in less than 4GB physical addressing space.  The bug is due to requesting for 64 bit address in the driver for 106E B1 chip and assuming them as 32bit addresses and putting them in 32-Bit scatter gather list.  By doing this the upper 32 bit address was lost.  So the DMA is actually occurring to the incorrect physical location.  That is resulted in infinite IOC recoveries. The issue is seeded when power management support is added in the driver. The fix is to request for 32bit addresses instead of 64bit addresses"

I will provide a test patch soon.

Thanks
Sathya

Comment 12 Sathya Prakash 2009-01-20 07:47:37 UTC
Created attachment 329437 [details]
The patch is to fix an issue of incorrectly setting DMA mask for 106XE controllers 

The patch which contains a fix as described in earlier comment.

Comment 13 Tomas Henzl 2009-01-20 12:43:11 UTC
Sathya,
thanks, I can confirm that the patch resolves this issue.

Comment 14 RHEL Program Management 2009-01-20 12:50:43 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 15 Tomas Henzl 2009-01-20 16:01:49 UTC
I'm posting the patch on internal list.

Comment 17 Vivek Goyal 2009-01-26 15:11:05 UTC
Committed in 80.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 19 Chris Ward 2009-03-27 14:20:41 UTC
~~ Attention Partners! Snap 1 Released ~~
RHEL 4.8 Snapshot 1 has been released on partners.redhat.com. There should
be a fix present, which addresses this bug. NOTE: there is only a short time
left to test, please test and report back results on this bug
at your earliest convenience.

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have found a NEW bug, clone this
bug and describe the issues you encountered. Further questions can be
directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the
Verified field above. Please leave a comment with your test results details.
Include which arches tested, package version and any applicable logs.

 - Red Hat QE Partner Management

Comment 20 Chris Ward 2009-04-16 13:31:25 UTC
Verified that the patch LSI confirmed is included in -88.EL.

Comment 22 errata-xmlrpc 2009-05-18 19:25:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html


Note You need to log in before you can comment on or make changes to this bug.