Description of problem: mlx4_core driver calls pci_enable_msix() to get 3 msix vectors but the function returns with a failure (-22). Version-Release number of selected component (if applicable): Red Hat Enterprise Linux Server release 5.1 (Tikanga) How reproducible: Just load infiniband drivers with mlx4 hardware but since we do not have anything special I believe it will happen for any device supporting msix. In fact, I cant see any device that works with msix (cat /proc/interrupts). Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: output of lspci for this HW: 13:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev a0) Subsystem: Mellanox Technologies MT25418 [ConnectX IB DDR] Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 177 Region 0: Memory at fdf00000 (64-bit, non-prefetchable) [size=1M] Region 2: Memory at f7000000 (64-bit, prefetchable) [size=8M] Region 4: Memory at fdef0000 (64-bit, non-prefetchable) [size=8K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Vital Product Data Capabilities: [9c] MSI-X: Enable- Mask- TabSize=256 Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00001000 Capabilities: [60] Express Endpoint IRQ 0 Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag+ Device: Latency L0s <64ns, L1 unlimited Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal+ Fatal+ Unsupported- Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- Device: MaxPayload 256 bytes, MaxReadReq 4096 bytes Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8 Link: Latency L0s unlimited, L1 unlimited Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch- Link: Speed 2.5Gb/s, Width x8
I think your problem may be hardware specific. From RHEL5.2 beta kernel on a Dell PowerEdge 1900 (or 1950, can't remember which): [dledford@ib0test2 ~]$ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 533316704 0 0 0 IO-APIC-edge timer 1: 3 0 0 0 IO-APIC-edge i8042 6: 5 0 0 0 IO-APIC-edge floppy 8: 1 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-level acpi 12: 4 0 0 0 IO-APIC-edge i8042 14: 4787812 2952 0 0 IO-APIC-edge ide0 66: 23814940 15201 0 0 IO-APIC-level uhci_hcd:usb1, uhci_hcd:usb3, ehci_hcd:usb5 74: 39 0 0 0 IO-APIC-level uhci_hcd:usb2, uhci_hcd:usb4 82: 556320 2038 40 0 IO-APIC-level libata 90: 93 119 0 0 PCI-MSI ib_ipath 98: 61 57 0 0 PCI-MSI ib_ipath 106: 0 0 0 0 PCI-MSI-X eth1 114: 27 0 0 0 PCI-MSI-X eth1 (queue 0) 178: 1740629 0 0 0 PCI-MSI eth0 NMI: 5939 2273 2527 2550 LOC: 522492125 522744130 522491999 522743982 ERR: 0 MIS: 0 [dledford@ib0test2 ~]$ I seem to recall we had to blacklist certain motherboard chipsets due to faulty MMCFG cycles. You may have one of those motherboards/chipsets and it may be refusing to allocate MSI interrupts because of that. Can you check to see if this is one of the affected platforms? The bugzillas for the RHEL5 bugs related to this are: Bugzillas ========= 182436 xw9400 AMD proccessors do not support ext config space 239673 stalled installation over HP Compaq 7700 250313 MCP55 chipset hides PCI EXTCFG 251032 add HP dl385g2 and dl585g2 to whitelist 252215 dl585g2/AMD8132 blacklist 253288 PCI domain support for x86/x86_64 408551 all PCI express registers are not accessible
It would be good to know what platform Eli was on. MMCFG is completely orthogonal to MSI-X. But there are also PCI quirks that disable MSI on various systems, and Eli may be hitting a new one.
The server is HP Proliant DL 360G5. We have kernel 2.6.24 running on this server with MSIX working fine.
Please try the latest RHEL 5.2 build and attach the output of dmesg. There is a MMCONFIG patch ACKed in Dec '07 and incorporated in Jan that should help any MMCONF problems. Rather than using a blacklist, the patch first tests the Nrorthbridge to see if MMCONFIG works. If so, then MMCONFIG (ergo MSI) is available. If the Northbridge does not respond correctly to MMCONFIG cycles, it is constrained to PortIO accesses, which may preclude MSI configuration.
We don't have here RHEL 5.2. If you send us a copy we will install and test here. Can you see any kind of HW using MSIX on this machine with RHEL 5.2?
Eli, you can grab the latest RHEL5.2 build from http://people.redhat.com/dzickus/el5/ AFAICT MSI-X is working: [root@hp-dl360g5-01 ~]# cat /proc/interrupts | grep MSI 114: 8667 1958 183 0 0 0 0 0 PCI-MSI-X cciss0 138: 5558 0 0 0 0 0 0 0 PCI-MSI eth0 P.
The RPMs we found at the url you specified do not contain kernel header files so we can't build our driver. Can you point us to missing kernels?
Here is the rpm for the latest kernel sources http://people.redhat.com/dzickus/el5/88.el5/src/kernel-2.6.18-88.el5.src.rpm
And the mlx4 driver is already in that kernel, so you don't need to build it separately. However, I'm pretty sure the problem here isn't related to the mlx4 driver but is related to the core msi interrupt handling instead. What I would greatly appreciate it if you could do is install the src rpm from comment #8, then build two new kernels using the two patches I'm going to attach to this bug. The easiest way to do that would be something like this: install the src rpm above download the two patches I'm attaching to this bug cd /usr/src/redhat/SPECS edit kernel-2.6.spec to uncomment the #% define buildid and use it to differentiate between the two builds for each build, copy one of the patches to /usr/src/redhat/SOURCES/linux-kernel-test.patch and then run rpmbuild --ba --with baseonly kernel-2.6.spec The binary rpms will get spit out into /usr/src/redhat/RPMS/<arch> where they can be installed and tested. I'm interested in knowing if either of the attached patches by themselves solves your problem, and if neither does, then whether or not both of them together does.
Created attachment 300288 [details] Add a quirk for ht1000 pci-e bridges This patch should really only make a difference if the failing system uses HT1000 pci-e bridges.
Created attachment 300289 [details] 4 different upstream changes related to msi handling This is a more generic set of changes that might resolve the issue regardless of the chipset in the system.
Doug, Is there any additional status? Any testing you want me to do? Or can we close this?
Not getting any responses, so I assume the patch submitted by Doug in comment 11 has fixed the problem. Please close this bug.