Bug 438776 - kernel panic with ib_ipath module with kernel 2.6.18-85.el5.x86_64
Summary: kernel panic with ib_ipath module with kernel 2.6.18-85.el5.x86_64
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Doug Ledford
QA Contact:
URL:
Whiteboard:
: 241257 439110 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-03-25 02:42 UTC by Gurhan Ozen
Modified: 2013-11-04 01:35 UTC (History)
3 users (show)

Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 15:12:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmidecode output (37.42 KB, text/plain)
2008-04-03 03:37 UTC, Gurhan Ozen
no flags Details
Quirk patch (3.39 KB, patch)
2008-04-03 17:46 UTC, Doug Ledford
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0314 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5.2 2008-05-20 18:43:34 UTC

Description Gurhan Ozen 2008-03-25 02:42:44 UTC
Description of problem:
  On the box with Qlogic HTX IB card, kernel 2.6.18-85.el5.x86_64 panics with
the following backtrace:

Starting udev: ----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at mm/slab.c:2650
invalid opcode: 0000 [1] SMP 
last sysfs file: /class/ib_ipath/ipath_diag0/dev
CPU 0 
Modules linked in: joydev ib_mthca i2c_piix4 ib_mad ide_cd ib_ipath k8_edac
shpchp i2c_core floppy ib_core edac_mc k8temp cdrom hwmon serio_rad
Pid: 989, comm: modprobe Not tainted 2.6.18-85.el5 #1
RIP: 0010:[<ffffffff8001717f>]  [<ffffffff8001717f>] cache_grow+0x1e/0x395
RSP: 0018:ffff81012f275a28  EFLAGS: 00010006
RAX: 0000000000000000 RBX: 00000000000080d0 RCX: 00000000ffffffff
RDX: 0000000000000000 RSI: 00000000000080d0 RDI: ffff810037cea1c0
RBP: ffff810002f9e2e0 R08: ffff810037cc5800 R09: ffff810037cc7000
R10: 0000000000000000 R11: ffffffff88220643 R12: ffff810037cea1c0
R13: ffff810002f9e2c0 R14: 0000000000000000 R15: ffff810037cea1c0
FS:  00002aaaab01b6e0(0000) GS:ffffffff8039d000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000555555762588 CR3: 000000012f286000 CR4: 00000000000006e0
Process modprobe (pid: 989, threadinfo ffff81012f274000, task ffff81012fe407a0)
Stack:  ffff81012f874c80 0000000000100000 ffff81007cb72910 ffffffff8011f5cd
 00000000000000d0 00000000ffffffff ffff810002f9e2e0 ffff810037cc5800
 ffff810002f9e2c0 000000000000003c ffff810037cea1c0 ffffffff8005be13
Call Trace:
 [<ffffffff8011f5cd>] inode_has_perm+0x56/0x63
 [<ffffffff8005be13>] cache_alloc_refill+0x136/0x186
 [<ffffffff800d351e>] kmem_cache_alloc_node+0x98/0xb2
 [<ffffffff800c9783>] __vmalloc_area_node+0x62/0x153
 [<ffffffff800c9ac9>] vmalloc_user+0x15/0x50
 [<ffffffff88207521>] :ib_ipath:ipath_create_cq+0x7c/0x1eb
 [<ffffffff8819f96c>] :ib_core:show_port_pkey+0x0/0x41
 [<ffffffff8825f266>] :ib_mad:ib_mad_thread_completion_handler+0x0/0x43
 [<ffffffff8819edb9>] :ib_core:ib_create_cq+0x27/0x55
 [<ffffffff8825eaf0>] :ib_mad:ib_mad_init_device+0x104/0x575
 [<ffffffff881a051b>] :ib_core:ib_register_device+0x365/0x402
 [<ffffffff8821f067>] :ib_ipath:ipath_get_counters+0x19c/0x1d6
 [<ffffffff8821fada>] :ib_ipath:ipath_register_ib_device+0x4e5/0x685
 [<ffffffff8820b8d5>] :ib_ipath:ipath_init_one+0xe18/0xe7f
 [<ffffffff8014f272>] pci_device_probe+0x100/0x180
 [<ffffffff801aef3f>] driver_probe_device+0x52/0xaa
 [<ffffffff801af06e>] __driver_attach+0x65/0xb6
 [<ffffffff801af009>] __driver_attach+0x0/0xb6
 [<ffffffff801ae980>] bus_for_each_dev+0x43/0x6e
 [<ffffffff801ae5c6>] bus_add_driver+0x7e/0x130
 [<ffffffff8014f44a>] __pci_register_driver+0x4b/0x6c
 [<ffffffff8824705c>] :ib_ipath:infinipath_init+0x5c/0xe1
 [<ffffffff800a3cef>] sys_init_module+0xaf/0x1e8
 [<ffffffff8005d116>] system_call+0x7e/0x83


Code: 0f 0b 68 0f 2e 29 80 c2 5a 0a f6 c7 20 0f 85 53 03 00 00 89 
RIP  [<ffffffff8001717f>] cache_grow+0x1e/0x395
 RSP <ffff81012f275a28>
 <0>Kernel panic - not syncing: Fatal exception



Version-Release number of selected component (if applicable):
kernel-2.6.18-85.el5.x86_64 .. RHEL5.2-Server-20080313.1 distro.. 

How reproducible:
Very.. 

Steps to Reproduce:
1. Install RHEL5.2-Server-20080313.1 distro on a box with a qlogic htx card.. 
2.  Reboot.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Doug Ledford 2008-03-31 20:02:54 UTC
The fix isn't a kernel patch, it's a change to the initscripts.  It will be
present in the openib-1.3-2.el5 package.

Comment 4 Gurhan Ozen 2008-04-03 03:20:41 UTC
Ok, so I have been trying all this with the latest packages from
RHBA-2008:8175-11 advisory and 2.6.18-88.el5 kernel, however i can't get qlogic
pciE cards to work at all. Kernel can see and recognize the cards:

22:00.0 InfiniBand: PathScale, Inc InfiniPath PE-800 (rev 01)
        Subsystem: PathScale, Inc InfiniPath PE-800
3a:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
        Subsystem: Mellanox Technologies MT23108 InfiniHost
51:00.0 InfiniBand: PathScale, Inc InfiniPath PE-800 (rev 02)
        Subsystem: PathScale, Inc InfiniPath PE-800
60:14.0 InfiniBand: PathScale, Inc InfiniPath HT-400 (rev 03)
        Subsystem: PathScale, Inc InfiniPath HT-400

But can't get them up:

Apr  2 22:40:29 ibm-ridgeback kernel: ib_ipath 0000:22:00.0: IB link is not ACTIVE
Apr  2 22:40:30 ibm-ridgeback kernel: ib_ipath 0000:51:00.0: IB link is not ACTIVE

I'll attach output of dmidecode to this bug. If you'd like to poke around, the
box is ibm-ridgeback.rhts.boston.redhat.com .

Comment 5 Gurhan Ozen 2008-04-03 03:37:40 UTC
Created attachment 300164 [details]
dmidecode output

Comment 6 Doug Ledford 2008-04-03 15:57:45 UTC
This isn't an infiniband problem.  This is related to the MMCONF PCI changes
made in the rhel5.2 kernel.  This is either an accidental or intentional victim
of those changes.  In particular, we see these messages from the kernel:

Linux version 2.6.18-88.el5 (brewbuilder.redhat.com) (gcc
version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 1 19:01:18 EDT 2008
Command line: ro root=/dev/VolGroup00/LogVol00 console=tty0 console=ttyS0,115200
rhgb quiet

...

ACPI: bus type pci registered
PCI: Using MMCONFIG at f0000000
PCI: No mmconfig possible on device 0:18
PCI: No mmconfig possible on device 0:19
PCI: No mmconfig possible on device 0:1a
PCI: No mmconfig possible on device 0:1b
PCI: Buses that can't use MMCONFIG will use type 1 PCI conf access.
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
Boot video device is 0000:00:01.0
PCI: Ignoring BAR0-3 of IDE controller 0000:00:08.1
PCI: If a device isn't working, try "pci=nommconf".

...

PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report

...

PCI: MSI quirk detected. MSI deactivated.
PCI: Setting latency timer of device 0000:00:0a.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:0a.0:pcie00]
Allocate Port Service[0000:00:0a.0:pcie01]
PCI: Setting latency timer of device 0000:00:0b.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:0b.0:pcie00]
Allocate Port Service[0000:00:0b.0:pcie01]
PCI: Setting latency timer of device 0000:00:0c.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:0c.0:pcie00]
Allocate Port Service[0000:00:0c.0:pcie01]
PCI: Setting latency timer of device 0000:00:0d.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:0d.0:pcie00]
Allocate Port Service[0000:00:0d.0:pcie01]
PCI: Setting latency timer of device 0000:00:0e.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:0e.0:pcie00]
Allocate Port Service[0000:00:0e.0:pcie01]
PCI: Setting latency timer of device 0000:40:0f.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:40:0f.0:pcie00]
Allocate Port Service[0000:40:0f.0:pcie01]
PCI: Setting latency timer of device 0000:40:10.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:40:10.0:pcie00]
Allocate Port Service[0000:40:10.0:pcie01]
PCI: Setting latency timer of device 0000:40:11.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:40:11.0:pcie00]
Allocate Port Service[0000:40:11.0:pcie01]
PCI: Setting latency timer of device 0000:40:12.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:40:12.0:pcie00]
Allocate Port Service[0000:40:12.0:pcie01]
PCI: Setting latency timer of device 0000:40:13.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:40:13.0:pcie00]
Allocate Port Service[0000:40:13.0:pcie01]

...

PCI: Setting latency timer of device 0000:22:00.0 to 64
ib_ipath 0000:22:00.0: infinipath0: pci_enable_msi failed: -22, interrupts may
not work
ib_ipath 0000:22:00.0: infinipath0: irq is 0, BIOS error?  Interrupts won't work
ib_ipath 0000:22:00.0: No interrupts detected, not usable.

...

ib_ipath 0000:51:00.0: infinipath1: pci_enable_msi failed: -22, interrupts may
not work
ib_ipath 0000:51:00.0: infinipath1: irq is 0, BIOS error?  Interrupts won't work
ib_ipath 0000:51:00.0: No interrupts detected, not usable.

...

ib_ipath 0000:22:00.0: IB link is not ACTIVE
ib_ipath 0000:51:00.0: IB link is not ACTIVE


So, the long and short of it is that on this particular hardware, with rhel5.1
kernels, msi interrupts worked on these two cards and now they don't.  It would
seem that the changes to the MSI interrupt handlers in the kernel are to blame.

Comment 7 Doug Ledford 2008-04-03 17:45:00 UTC
Changing component to kernel.  One of Andy Gospodarek's PCI patches resolved the
issue entirely.  I'll attach that patch to this report and also post to
rhkernel-list.

Comment 8 Doug Ledford 2008-04-03 17:46:18 UTC
Created attachment 300290 [details]
Quirk patch

This patch (from Andy Gospodarek) has been confirmed to solve the problem on
the target machine.

Comment 9 Gurhan Ozen 2008-04-03 18:03:19 UTC
Yeah, indeed the kernel with Andy's patch ,2.6.18-88.el5.ht1000_quirk, is
working.. We should have another spin for 5.2 to have this patch included ...

Comment 11 Andy Gospodarek 2008-04-08 21:36:52 UTC
*** Bug 439110 has been marked as a duplicate of this bug. ***

Comment 13 Don Zickus 2008-04-09 18:44:36 UTC
in kernel-2.6.18-89.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 14 Gurhan Ozen 2008-04-09 19:44:17 UTC
(In reply to comment #13)
> in kernel-2.6.18-89.el5
> You can download this test kernel from http://people.redhat.com/dzickus/el5

Yup, this kernel indeed works:

[root@ibm-ridgeback ~]# uname -a
Linux ibm-ridgeback.rhts.boston.redhat.com 2.6.18-89.el5 #1 SMP Tue Apr 8
16:04:14 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[root@ibm-ridgeback ~]# ibstat ipath1
CA 'ipath1'
        CA type: InfiniPath_QLE7140
        Number of ports: 1
        Firmware version: 
        Hardware version: 2
        Node GUID: 0x0011750000ffd9ce
        System image GUID: 0x001175000068709f
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 3
                LMC: 0
                SM lid: 2
                Capability mask: 0x02010800
                Port GUID: 0x0011750000ffd9ce


Comment 16 errata-xmlrpc 2008-05-21 15:12:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html


Comment 17 Prarit Bhargava 2008-06-19 17:32:08 UTC
*** Bug 241257 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.