Description of problem: On the box with Qlogic HTX IB card, kernel 2.6.18-85.el5.x86_64 panics with the following backtrace: Starting udev: ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at mm/slab.c:2650 invalid opcode: 0000 [1] SMP last sysfs file: /class/ib_ipath/ipath_diag0/dev CPU 0 Modules linked in: joydev ib_mthca i2c_piix4 ib_mad ide_cd ib_ipath k8_edac shpchp i2c_core floppy ib_core edac_mc k8temp cdrom hwmon serio_rad Pid: 989, comm: modprobe Not tainted 2.6.18-85.el5 #1 RIP: 0010:[<ffffffff8001717f>] [<ffffffff8001717f>] cache_grow+0x1e/0x395 RSP: 0018:ffff81012f275a28 EFLAGS: 00010006 RAX: 0000000000000000 RBX: 00000000000080d0 RCX: 00000000ffffffff RDX: 0000000000000000 RSI: 00000000000080d0 RDI: ffff810037cea1c0 RBP: ffff810002f9e2e0 R08: ffff810037cc5800 R09: ffff810037cc7000 R10: 0000000000000000 R11: ffffffff88220643 R12: ffff810037cea1c0 R13: ffff810002f9e2c0 R14: 0000000000000000 R15: ffff810037cea1c0 FS: 00002aaaab01b6e0(0000) GS:ffffffff8039d000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000555555762588 CR3: 000000012f286000 CR4: 00000000000006e0 Process modprobe (pid: 989, threadinfo ffff81012f274000, task ffff81012fe407a0) Stack: ffff81012f874c80 0000000000100000 ffff81007cb72910 ffffffff8011f5cd 00000000000000d0 00000000ffffffff ffff810002f9e2e0 ffff810037cc5800 ffff810002f9e2c0 000000000000003c ffff810037cea1c0 ffffffff8005be13 Call Trace: [<ffffffff8011f5cd>] inode_has_perm+0x56/0x63 [<ffffffff8005be13>] cache_alloc_refill+0x136/0x186 [<ffffffff800d351e>] kmem_cache_alloc_node+0x98/0xb2 [<ffffffff800c9783>] __vmalloc_area_node+0x62/0x153 [<ffffffff800c9ac9>] vmalloc_user+0x15/0x50 [<ffffffff88207521>] :ib_ipath:ipath_create_cq+0x7c/0x1eb [<ffffffff8819f96c>] :ib_core:show_port_pkey+0x0/0x41 [<ffffffff8825f266>] :ib_mad:ib_mad_thread_completion_handler+0x0/0x43 [<ffffffff8819edb9>] :ib_core:ib_create_cq+0x27/0x55 [<ffffffff8825eaf0>] :ib_mad:ib_mad_init_device+0x104/0x575 [<ffffffff881a051b>] :ib_core:ib_register_device+0x365/0x402 [<ffffffff8821f067>] :ib_ipath:ipath_get_counters+0x19c/0x1d6 [<ffffffff8821fada>] :ib_ipath:ipath_register_ib_device+0x4e5/0x685 [<ffffffff8820b8d5>] :ib_ipath:ipath_init_one+0xe18/0xe7f [<ffffffff8014f272>] pci_device_probe+0x100/0x180 [<ffffffff801aef3f>] driver_probe_device+0x52/0xaa [<ffffffff801af06e>] __driver_attach+0x65/0xb6 [<ffffffff801af009>] __driver_attach+0x0/0xb6 [<ffffffff801ae980>] bus_for_each_dev+0x43/0x6e [<ffffffff801ae5c6>] bus_add_driver+0x7e/0x130 [<ffffffff8014f44a>] __pci_register_driver+0x4b/0x6c [<ffffffff8824705c>] :ib_ipath:infinipath_init+0x5c/0xe1 [<ffffffff800a3cef>] sys_init_module+0xaf/0x1e8 [<ffffffff8005d116>] system_call+0x7e/0x83 Code: 0f 0b 68 0f 2e 29 80 c2 5a 0a f6 c7 20 0f 85 53 03 00 00 89 RIP [<ffffffff8001717f>] cache_grow+0x1e/0x395 RSP <ffff81012f275a28> <0>Kernel panic - not syncing: Fatal exception Version-Release number of selected component (if applicable): kernel-2.6.18-85.el5.x86_64 .. RHEL5.2-Server-20080313.1 distro.. How reproducible: Very.. Steps to Reproduce: 1. Install RHEL5.2-Server-20080313.1 distro on a box with a qlogic htx card.. 2. Reboot. 3. Actual results: Expected results: Additional info:
The fix isn't a kernel patch, it's a change to the initscripts. It will be present in the openib-1.3-2.el5 package.
Ok, so I have been trying all this with the latest packages from RHBA-2008:8175-11 advisory and 2.6.18-88.el5 kernel, however i can't get qlogic pciE cards to work at all. Kernel can see and recognize the cards: 22:00.0 InfiniBand: PathScale, Inc InfiniPath PE-800 (rev 01) Subsystem: PathScale, Inc InfiniPath PE-800 3a:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technologies MT23108 InfiniHost 51:00.0 InfiniBand: PathScale, Inc InfiniPath PE-800 (rev 02) Subsystem: PathScale, Inc InfiniPath PE-800 60:14.0 InfiniBand: PathScale, Inc InfiniPath HT-400 (rev 03) Subsystem: PathScale, Inc InfiniPath HT-400 But can't get them up: Apr 2 22:40:29 ibm-ridgeback kernel: ib_ipath 0000:22:00.0: IB link is not ACTIVE Apr 2 22:40:30 ibm-ridgeback kernel: ib_ipath 0000:51:00.0: IB link is not ACTIVE I'll attach output of dmidecode to this bug. If you'd like to poke around, the box is ibm-ridgeback.rhts.boston.redhat.com .
Created attachment 300164 [details] dmidecode output
This isn't an infiniband problem. This is related to the MMCONF PCI changes made in the rhel5.2 kernel. This is either an accidental or intentional victim of those changes. In particular, we see these messages from the kernel: Linux version 2.6.18-88.el5 (brewbuilder.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 1 19:01:18 EDT 2008 Command line: ro root=/dev/VolGroup00/LogVol00 console=tty0 console=ttyS0,115200 rhgb quiet ... ACPI: bus type pci registered PCI: Using MMCONFIG at f0000000 PCI: No mmconfig possible on device 0:18 PCI: No mmconfig possible on device 0:19 PCI: No mmconfig possible on device 0:1a PCI: No mmconfig possible on device 0:1b PCI: Buses that can't use MMCONFIG will use type 1 PCI conf access. ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: PCI Root Bridge [PCI0] (0000:00) PCI: Probing PCI hardware (bus 00) Boot video device is 0000:00:01.0 PCI: Ignoring BAR0-3 of IDE controller 0000:00:08.1 PCI: If a device isn't working, try "pci=nommconf". ... PCI: Using ACPI for IRQ routing PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report ... PCI: MSI quirk detected. MSI deactivated. PCI: Setting latency timer of device 0000:00:0a.0 to 64 assign_interrupt_mode Found MSI capability Allocate Port Service[0000:00:0a.0:pcie00] Allocate Port Service[0000:00:0a.0:pcie01] PCI: Setting latency timer of device 0000:00:0b.0 to 64 assign_interrupt_mode Found MSI capability Allocate Port Service[0000:00:0b.0:pcie00] Allocate Port Service[0000:00:0b.0:pcie01] PCI: Setting latency timer of device 0000:00:0c.0 to 64 assign_interrupt_mode Found MSI capability Allocate Port Service[0000:00:0c.0:pcie00] Allocate Port Service[0000:00:0c.0:pcie01] PCI: Setting latency timer of device 0000:00:0d.0 to 64 assign_interrupt_mode Found MSI capability Allocate Port Service[0000:00:0d.0:pcie00] Allocate Port Service[0000:00:0d.0:pcie01] PCI: Setting latency timer of device 0000:00:0e.0 to 64 assign_interrupt_mode Found MSI capability Allocate Port Service[0000:00:0e.0:pcie00] Allocate Port Service[0000:00:0e.0:pcie01] PCI: Setting latency timer of device 0000:40:0f.0 to 64 assign_interrupt_mode Found MSI capability Allocate Port Service[0000:40:0f.0:pcie00] Allocate Port Service[0000:40:0f.0:pcie01] PCI: Setting latency timer of device 0000:40:10.0 to 64 assign_interrupt_mode Found MSI capability Allocate Port Service[0000:40:10.0:pcie00] Allocate Port Service[0000:40:10.0:pcie01] PCI: Setting latency timer of device 0000:40:11.0 to 64 assign_interrupt_mode Found MSI capability Allocate Port Service[0000:40:11.0:pcie00] Allocate Port Service[0000:40:11.0:pcie01] PCI: Setting latency timer of device 0000:40:12.0 to 64 assign_interrupt_mode Found MSI capability Allocate Port Service[0000:40:12.0:pcie00] Allocate Port Service[0000:40:12.0:pcie01] PCI: Setting latency timer of device 0000:40:13.0 to 64 assign_interrupt_mode Found MSI capability Allocate Port Service[0000:40:13.0:pcie00] Allocate Port Service[0000:40:13.0:pcie01] ... PCI: Setting latency timer of device 0000:22:00.0 to 64 ib_ipath 0000:22:00.0: infinipath0: pci_enable_msi failed: -22, interrupts may not work ib_ipath 0000:22:00.0: infinipath0: irq is 0, BIOS error? Interrupts won't work ib_ipath 0000:22:00.0: No interrupts detected, not usable. ... ib_ipath 0000:51:00.0: infinipath1: pci_enable_msi failed: -22, interrupts may not work ib_ipath 0000:51:00.0: infinipath1: irq is 0, BIOS error? Interrupts won't work ib_ipath 0000:51:00.0: No interrupts detected, not usable. ... ib_ipath 0000:22:00.0: IB link is not ACTIVE ib_ipath 0000:51:00.0: IB link is not ACTIVE So, the long and short of it is that on this particular hardware, with rhel5.1 kernels, msi interrupts worked on these two cards and now they don't. It would seem that the changes to the MSI interrupt handlers in the kernel are to blame.
Changing component to kernel. One of Andy Gospodarek's PCI patches resolved the issue entirely. I'll attach that patch to this report and also post to rhkernel-list.
Created attachment 300290 [details] Quirk patch This patch (from Andy Gospodarek) has been confirmed to solve the problem on the target machine.
Yeah, indeed the kernel with Andy's patch ,2.6.18-88.el5.ht1000_quirk, is working.. We should have another spin for 5.2 to have this patch included ...
*** Bug 439110 has been marked as a duplicate of this bug. ***
in kernel-2.6.18-89.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
(In reply to comment #13) > in kernel-2.6.18-89.el5 > You can download this test kernel from http://people.redhat.com/dzickus/el5 Yup, this kernel indeed works: [root@ibm-ridgeback ~]# uname -a Linux ibm-ridgeback.rhts.boston.redhat.com 2.6.18-89.el5 #1 SMP Tue Apr 8 16:04:14 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux [root@ibm-ridgeback ~]# ibstat ipath1 CA 'ipath1' CA type: InfiniPath_QLE7140 Number of ports: 1 Firmware version: Hardware version: 2 Node GUID: 0x0011750000ffd9ce System image GUID: 0x001175000068709f Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 3 LMC: 0 SM lid: 2 Capability mask: 0x02010800 Port GUID: 0x0011750000ffd9ce
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html
*** Bug 241257 has been marked as a duplicate of this bug. ***