Bug 438330 - HP dl360g5: pci_enable_msix() fails
Summary: HP dl360g5: pci_enable_msix() fails
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.1
Hardware: x86_64
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Tony Camuso
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-03-20 13:03 UTC by Eli Cohen
Modified: 2010-11-17 19:08 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-11-17 19:08:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Add a quirk for ht1000 pci-e bridges (3.39 KB, patch)
2008-04-03 17:38 UTC, Doug Ledford
no flags Details | Diff
4 different upstream changes related to msi handling (13.45 KB, patch)
2008-04-03 17:39 UTC, Doug Ledford
no flags Details | Diff

Description Eli Cohen 2008-03-20 13:03:32 UTC
Description of problem:
mlx4_core driver calls pci_enable_msix() to get 3 msix vectors but the function
returns with a failure (-22).


Version-Release number of selected component (if applicable):

Red Hat Enterprise Linux Server release 5.1 (Tikanga)


How reproducible:
Just load infiniband drivers with mlx4 hardware but since we do not have
anything special I believe it will happen for any device supporting msix. In
fact, I cant see any device that works with msix (cat /proc/interrupts).


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
output of lspci for this HW:

13:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev a0)
        Subsystem: Mellanox Technologies MT25418 [ConnectX IB DDR]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 177
        Region 0: Memory at fdf00000 (64-bit, non-prefetchable) [size=1M]
        Region 2: Memory at f7000000 (64-bit, prefetchable) [size=8M]
        Region 4: Memory at fdef0000 (64-bit, non-prefetchable) [size=8K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] Vital Product Data
        Capabilities: [9c] MSI-X: Enable- Mask- TabSize=256
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00001000
        Capabilities: [60] Express Endpoint IRQ 0
                Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag+
                Device: Latency L0s <64ns, L1 unlimited
                Device: AtnBtn- AtnInd- PwrInd-
                Device: Errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
                Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                Device: MaxPayload 256 bytes, MaxReadReq 4096 bytes
                Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8
                Link: Latency L0s unlimited, L1 unlimited
                Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
                Link: Speed 2.5Gb/s, Width x8

Comment 1 Doug Ledford 2008-03-20 19:11:13 UTC
I think your problem may be hardware specific.  From RHEL5.2 beta kernel on a
Dell PowerEdge 1900 (or 1950, can't remember which):

[dledford@ib0test2 ~]$ cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       
  0:  533316704          0          0          0    IO-APIC-edge  timer
  1:          3          0          0          0    IO-APIC-edge  i8042
  6:          5          0          0          0    IO-APIC-edge  floppy
  8:          1          0          0          0    IO-APIC-edge  rtc
  9:          0          0          0          0   IO-APIC-level  acpi
 12:          4          0          0          0    IO-APIC-edge  i8042
 14:    4787812       2952          0          0    IO-APIC-edge  ide0
 66:   23814940      15201          0          0   IO-APIC-level  uhci_hcd:usb1,
uhci_hcd:usb3, ehci_hcd:usb5
 74:         39          0          0          0   IO-APIC-level  uhci_hcd:usb2,
uhci_hcd:usb4
 82:     556320       2038         40          0   IO-APIC-level  libata
 90:         93        119          0          0         PCI-MSI  ib_ipath
 98:         61         57          0          0         PCI-MSI  ib_ipath
106:          0          0          0          0       PCI-MSI-X  eth1
114:         27          0          0          0       PCI-MSI-X  eth1 (queue 0)
178:    1740629          0          0          0         PCI-MSI  eth0
NMI:       5939       2273       2527       2550 
LOC:  522492125  522744130  522491999  522743982 
ERR:          0
MIS:          0
[dledford@ib0test2 ~]$ 


I seem to recall we had to blacklist certain motherboard chipsets due to faulty
MMCFG cycles.  You may have one of those motherboards/chipsets and it may be
refusing to allocate MSI interrupts because of that.  Can you check to see if
this is one of the affected platforms?  The bugzillas for the RHEL5 bugs related
to this are:

Bugzillas
=========

182436 xw9400 AMD proccessors do not support ext config space
239673 stalled installation over HP Compaq 7700
250313 MCP55 chipset hides PCI EXTCFG
251032 add HP dl385g2 and dl585g2 to whitelist
252215 dl585g2/AMD8132 blacklist
253288 PCI domain support for x86/x86_64
408551 all PCI express registers are not accessible

Comment 2 Roland Dreier 2008-03-20 21:07:39 UTC
It would be good to know what platform Eli was on.

MMCFG is completely orthogonal to MSI-X.  But there are also PCI quirks that
disable MSI on various systems, and Eli may be hitting a new one.

Comment 3 Eli Cohen 2008-03-24 08:14:24 UTC
The server is HP Proliant DL 360G5.
We have kernel 2.6.24 running on this server with MSIX working fine.

Comment 4 Tony Camuso 2008-03-24 14:53:28 UTC
Please try the latest RHEL 5.2 build and attach the output of dmesg. 

There is a MMCONFIG patch ACKed in Dec '07 and incorporated in Jan that should
help any MMCONF problems. 

Rather than using a blacklist, the patch first tests the Nrorthbridge to see if
MMCONFIG works. If so, then MMCONFIG (ergo MSI) is available. If the Northbridge
does not respond correctly to MMCONFIG cycles, it is constrained to PortIO
accesses, which may preclude MSI configuration. 



Comment 5 Eli Cohen 2008-03-24 15:07:33 UTC
We don't have here RHEL 5.2. If you send us a copy we will install and test here.
Can you see any kind of HW using MSIX on this machine with RHEL 5.2?

Comment 6 Prarit Bhargava 2008-03-24 15:41:13 UTC
Eli, you can grab the latest RHEL5.2 build from

http://people.redhat.com/dzickus/el5/

AFAICT MSI-X is working:

[root@hp-dl360g5-01 ~]# cat /proc/interrupts | grep MSI
114:       8667       1958        183          0          0          0         
0          0       PCI-MSI-X  cciss0
138:       5558          0          0          0          0          0         
0          0         PCI-MSI  eth0

P.

Comment 7 Eli Cohen 2008-04-01 07:28:39 UTC
The RPMs we found at the url you specified do not contain kernel header files 
so we can't build our driver. Can you point us to missing kernels?

Comment 8 Tony Camuso 2008-04-02 18:57:02 UTC
Here is the rpm for the latest kernel sources 

http://people.redhat.com/dzickus/el5/88.el5/src/kernel-2.6.18-88.el5.src.rpm



Comment 9 Doug Ledford 2008-04-03 17:37:00 UTC
And the mlx4 driver is already in that kernel, so you don't need to build it
separately.

However, I'm pretty sure the problem here isn't related to the mlx4 driver but
is related to the core msi interrupt handling instead.  What I would greatly
appreciate it if you could do is install the src rpm from comment #8, then build
two new kernels using the two patches I'm going to attach to this bug.  The
easiest way to do that would be something like this:

install the src rpm above
download the two patches I'm attaching to this bug
cd /usr/src/redhat/SPECS
edit kernel-2.6.spec to uncomment the #% define buildid and use it to
differentiate between the two builds
for each build, copy one of the patches to
/usr/src/redhat/SOURCES/linux-kernel-test.patch and then run rpmbuild --ba
--with baseonly kernel-2.6.spec
The binary rpms will get spit out into /usr/src/redhat/RPMS/<arch> where they
can be installed and tested.  I'm interested in knowing if either of the
attached patches by themselves solves your problem, and if neither does, then
whether or not both of them together does.

Comment 10 Doug Ledford 2008-04-03 17:38:27 UTC
Created attachment 300288 [details]
Add a quirk for ht1000 pci-e bridges

This patch should really only make a difference if the failing system uses
HT1000 pci-e bridges.

Comment 11 Doug Ledford 2008-04-03 17:39:28 UTC
Created attachment 300289 [details]
4 different upstream changes related to msi handling

This is a more generic set of changes that might resolve the issue regardless
of the chipset in the system.

Comment 12 Tony Camuso 2009-06-01 12:32:54 UTC
Doug, 

Is there any additional status?

Any testing you want me to do?

Or can we close this?

Comment 14 Tony Camuso 2009-06-17 11:14:44 UTC
Not getting any responses, so I assume the patch submitted by Doug in comment 11 has fixed the problem. 

Please close this bug.


Note You need to log in before you can comment on or make changes to this bug.