Red Hat Bugzilla – Bug 438330
HP dl360g5: pci_enable_msix() fails
Last modified: 2010-11-17 14:08:52 EST
Description of problem:
mlx4_core driver calls pci_enable_msix() to get 3 msix vectors but the function
returns with a failure (-22).
Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 5.1 (Tikanga)
Just load infiniband drivers with mlx4 hardware but since we do not have
anything special I believe it will happen for any device supporting msix. In
fact, I cant see any device that works with msix (cat /proc/interrupts).
Steps to Reproduce:
output of lspci for this HW:
13:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev a0)
Subsystem: Mellanox Technologies MT25418 [ConnectX IB DDR]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 177
Region 0: Memory at fdf00000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at f7000000 (64-bit, prefetchable) [size=8M]
Region 4: Memory at fdef0000 (64-bit, non-prefetchable) [size=8K]
Capabilities:  Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities:  Vital Product Data
Capabilities: [9c] MSI-X: Enable- Mask- TabSize=256
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00001000
Capabilities:  Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag+
Device: Latency L0s <64ns, L1 unlimited
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 256 bytes, MaxReadReq 4096 bytes
Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8
Link: Latency L0s unlimited, L1 unlimited
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x8
I think your problem may be hardware specific. From RHEL5.2 beta kernel on a
Dell PowerEdge 1900 (or 1950, can't remember which):
[dledford@ib0test2 ~]$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 533316704 0 0 0 IO-APIC-edge timer
1: 3 0 0 0 IO-APIC-edge i8042
6: 5 0 0 0 IO-APIC-edge floppy
8: 1 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 IO-APIC-level acpi
12: 4 0 0 0 IO-APIC-edge i8042
14: 4787812 2952 0 0 IO-APIC-edge ide0
66: 23814940 15201 0 0 IO-APIC-level uhci_hcd:usb1,
74: 39 0 0 0 IO-APIC-level uhci_hcd:usb2,
82: 556320 2038 40 0 IO-APIC-level libata
90: 93 119 0 0 PCI-MSI ib_ipath
98: 61 57 0 0 PCI-MSI ib_ipath
106: 0 0 0 0 PCI-MSI-X eth1
114: 27 0 0 0 PCI-MSI-X eth1 (queue 0)
178: 1740629 0 0 0 PCI-MSI eth0
NMI: 5939 2273 2527 2550
LOC: 522492125 522744130 522491999 522743982
I seem to recall we had to blacklist certain motherboard chipsets due to faulty
MMCFG cycles. You may have one of those motherboards/chipsets and it may be
refusing to allocate MSI interrupts because of that. Can you check to see if
this is one of the affected platforms? The bugzillas for the RHEL5 bugs related
to this are:
182436 xw9400 AMD proccessors do not support ext config space
239673 stalled installation over HP Compaq 7700
250313 MCP55 chipset hides PCI EXTCFG
251032 add HP dl385g2 and dl585g2 to whitelist
252215 dl585g2/AMD8132 blacklist
253288 PCI domain support for x86/x86_64
408551 all PCI express registers are not accessible
It would be good to know what platform Eli was on.
MMCFG is completely orthogonal to MSI-X. But there are also PCI quirks that
disable MSI on various systems, and Eli may be hitting a new one.
The server is HP Proliant DL 360G5.
We have kernel 2.6.24 running on this server with MSIX working fine.
Please try the latest RHEL 5.2 build and attach the output of dmesg.
There is a MMCONFIG patch ACKed in Dec '07 and incorporated in Jan that should
help any MMCONF problems.
Rather than using a blacklist, the patch first tests the Nrorthbridge to see if
MMCONFIG works. If so, then MMCONFIG (ergo MSI) is available. If the Northbridge
does not respond correctly to MMCONFIG cycles, it is constrained to PortIO
accesses, which may preclude MSI configuration.
We don't have here RHEL 5.2. If you send us a copy we will install and test here.
Can you see any kind of HW using MSIX on this machine with RHEL 5.2?
Eli, you can grab the latest RHEL5.2 build from
AFAICT MSI-X is working:
[root@hp-dl360g5-01 ~]# cat /proc/interrupts | grep MSI
114: 8667 1958 183 0 0 0
0 0 PCI-MSI-X cciss0
138: 5558 0 0 0 0 0
0 0 PCI-MSI eth0
The RPMs we found at the url you specified do not contain kernel header files
so we can't build our driver. Can you point us to missing kernels?
Here is the rpm for the latest kernel sources
And the mlx4 driver is already in that kernel, so you don't need to build it
However, I'm pretty sure the problem here isn't related to the mlx4 driver but
is related to the core msi interrupt handling instead. What I would greatly
appreciate it if you could do is install the src rpm from comment #8, then build
two new kernels using the two patches I'm going to attach to this bug. The
easiest way to do that would be something like this:
install the src rpm above
download the two patches I'm attaching to this bug
edit kernel-2.6.spec to uncomment the #% define buildid and use it to
differentiate between the two builds
for each build, copy one of the patches to
/usr/src/redhat/SOURCES/linux-kernel-test.patch and then run rpmbuild --ba
--with baseonly kernel-2.6.spec
The binary rpms will get spit out into /usr/src/redhat/RPMS/<arch> where they
can be installed and tested. I'm interested in knowing if either of the
attached patches by themselves solves your problem, and if neither does, then
whether or not both of them together does.
Created attachment 300288 [details]
Add a quirk for ht1000 pci-e bridges
This patch should really only make a difference if the failing system uses
HT1000 pci-e bridges.
Created attachment 300289 [details]
4 different upstream changes related to msi handling
This is a more generic set of changes that might resolve the issue regardless
of the chipset in the system.
Is there any additional status?
Any testing you want me to do?
Or can we close this?
Not getting any responses, so I assume the patch submitted by Doug in comment 11 has fixed the problem.
Please close this bug.