Bug 501578 - Kernel BUG at include/linux/netdevice.h:921 - :e1000e:e1000_intr_msi+0xd2/0xdc
Summary: Kernel BUG at include/linux/netdevice.h:921 - :e1000e:e1000_intr_msi+0xd2/0xdc
Keywords:
Status: CLOSED DUPLICATE of bug 511918
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: x86_64
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Andy Gospodarek
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-05-19 20:03 UTC by Marcus Alves Grando
Modified: 2014-06-29 23:01 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-08-11 17:33:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
netpoll-napi-fix.patch (747 bytes, patch)
2009-05-26 20:31 UTC, Andy Gospodarek
no flags Details | Diff

Description Marcus Alves Grando 2009-05-19 20:03:13 UTC
Hello guys,

I have found a problem testing ocfs2 but it's not related with ocfs2. Maybe something in e1000 driver. See below:

# uname -a
Linux 12r.tpn.terra.com 2.6.18-128.1.10.el5 #1 SMP Wed Apr 29 13:53:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

crash> log
...
Kernel BUG at include/linux/netdevice.h:921
invalid opcode: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:01.0/irq
CPU 2 
Modules linked in: hangcheck_timer ocfs2(U) ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs netconsole mptctl mptbase ipmi_devintf ipmi_si ipmi_msghandler dell_rbu nfsd exportfs lockd nfs_acl auth_rpc
gss sunrpc bonding iptable_filter ip_tables x_tables dm_round_robin dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sg ide_cd k8temp bnx2
 e1000e hwmon i2c_piix4 k8_edac serio_raw pcspkr edac_mc i2c_core cdrom dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc sata_svw libata shp
chp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 8213, comm: syslog-ng Tainted: G      2.6.18-128.1.10.el5 #1
RIP: 0010:[<ffffffff88342b42>]  [<ffffffff88342b42>] :e1000e:e1000_clean+0x20e/0x2ae
RSP: 0000:ffff81010fcf7ec0  EFLAGS: 00010046
RAX: 0000000000000006 RBX: 0000000000000246 RCX: 0000000000000000
RDX: 0000000000004e20 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff81047d814500 R08: 00007fffec01f560 R09: ffffc200103a5fa0
R10: 0000000000000008 R11: ffffc200103a5f78 R12: ffff81047d814000
R13: 0000000000000003 R14: 0000000000000040 R15: 0000000000000001
FS:  00002ab1bea98bf0(0000) GS:ffff81010fc99440(0000) knlGS:00000000f5abab90
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003727615160 CR3: 00000001e99e9000 CR4: 00000000000006e0
Process syslog-ng (pid: 8213, threadinfo ffff8105fc33c000, task ffff81087dad7820)
Stack:  ffff81000900ea64 000000007d814000 ffff81047d814000 ffff81048e18ea00
 ffff81000900f2a0 ffff81000900f280 0000000100754969 ffffffff8000c5bc
 ffffffff88342862 0000012cfc33df58 0000000000000046 0000000000000001
Call Trace:
 <IRQ>  [<ffffffff8000c5bc>] net_rx_action+0xa4/0x1a4
 [<ffffffff88342862>] :e1000e:e1000_intr_msi+0xd2/0xdc
 [<ffffffff80011fbc>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cada>] do_softirq+0x2c/0x85
 [<ffffffff8006c962>] do_IRQ+0xec/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI> 

Code: 0f 0b 68 5d 67 34 88 c2 99 03 49 8d bc 24 80 01 00 00 e8 5a 
RIP  [<ffffffff88342b42>] :e1000e:e1000_clean+0x20e/0x2ae
 RSP <ffff81010fcf7ec0>
crash> bt
PID: 8213   TASK: ffff81087dad7820  CPU: 2   COMMAND: "syslog-ng"
 #0 [ffff81010fcf7c20] crash_kexec at ffffffff800aaa8f
 #1 [ffff81010fcf7ce0] __die at ffffffff8006520f
 #2 [ffff81010fcf7d20] die at ffffffff8006bc17
 #3 [ffff81010fcf7d50] do_invalid_op at ffffffff8006c1d7
 #4 [ffff81010fcf7e10] error_exit at ffffffff8005dde9
    [exception RIP: e1000_clean+526]
    RIP: ffffffff88342b42  RSP: ffff81010fcf7ec0  RFLAGS: 00010046
    RAX: 0000000000000006  RBX: 0000000000000246  RCX: 0000000000000000
    RDX: 0000000000004e20  RSI: 0000000000000001  RDI: 0000000000000000
    RBP: ffff81047d814500   R8: 00007fffec01f560   R9: ffffc200103a5fa0
    R10: 0000000000000008  R11: ffffc200103a5f78  R12: ffff81047d814000
    R13: 0000000000000003  R14: 0000000000000040  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #5 [ffff81010fcf7ef8] net_rx_action at ffffffff8000c5bc
 #6 [ffff81010fcf7f38] __do_softirq at ffffffff80011fbc
 #7 [ffff81010fcf7f68] call_softirq at ffffffff8005e2fc
 #8 [ffff81010fcf7f80] do_softirq at ffffffff8006cada
 #9 [ffff81010fcf7f90] do_IRQ at ffffffff8006c962
--- <IRQ stack> ---
#10 [ffff8105fc33df58] ret_from_intr at ffffffff8005d615
    RIP: 0000003724d05eaf  RSP: 00007fffec01f4c0  RFLAGS: 00000217
    RAX: 0000000000000000  RBX: 00007fffec01f680  RCX: 0000003724c994c7
    RDX: 00007fffec02169f  RSI: 0000000000000000  RDI: 0000000000000011
    RBP: 0000000000000000   R8: 00007fffec01f560   R9: 00007fffec01f4c0
    R10: 0000000000000008  R11: 0000000000000212  R12: ffffffff8009500a
    R13: 00007fffec01f680  R14: 0000000000000000  R15: 0000000000000001
    ORIG_RAX: ffffffffffffff25  CS: 0033  SS: 002b

If you need more info crom crash I can provide.

Regards

Comment 1 Marcus Alves Grando 2009-05-19 20:06:55 UTC
Maybe it's the same problem as AS4 bz443034?

Regards

Comment 2 Andy Gospodarek 2009-05-19 21:18:18 UTC
The bug-halt being hit is include/linux/netdevice.h:921

 916 static inline void netif_rx_complete(struct net_device *dev)
 917 {
 918         unsigned long flags;
 919 
 920         local_irq_save(flags);
 921         BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));
 922         list_del(&dev->poll_list);
 923         smp_mb__before_clear_bit();
 924         clear_bit(__LINK_STATE_RX_SCHED, &dev->state);
 925         local_irq_restore(flags);
 926 }

This panic happens specifically when find ourselves polling on a netdev that is not on the poll-list (or more specifically doesn't have the __LINK_STATE_RX_SCHED bit set when) netif_rx_complete is called.

It is quite rare that this can happen (and I think is only a problem when netconsole is loaded), but I think it can happen like this:

CPU0                                 CPU1
----                                 ----
do_IRQ                               netpoll_send_skb
  do_softirq                           netpoll_poll
    call_softirq                         poll_napi
      __do_softirq                         spin_trylock(poll_lock)
        net_rx_action                        dev->poll (e1000_clean)
          spin_lock(poll_lock)                 netif_rx_complete
                                           spin_unlock(poll_lock)
          dev->poll (e1000_clean)
            netif_rx_complete
              BUG!

Because this is on the receive path and is a bug-halt rather than a crash, I suspect bug 443034 is not related.

How easy is this to reproduce?  I've seen similar problems on older versions of e1000, but I was pretty sure this had been resolved by now.  Have you tried to reproduce this without using netconsole?

Comment 3 Marcus Alves Grando 2009-05-20 01:57:23 UTC
Yes Andy. It's easy reproducible. I just start Oracle+ocfs2+netconsole in five nodes and reboot one of them.

Without netconsole I can't reproduce this.

# lspci 
00:01.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge
00:02.0 Host bridge: Broadcom BCM5785 [HT1000] Legacy South Bridge
00:02.1 IDE interface: Broadcom BCM5785 [HT1000] IDE
00:02.2 ISA bridge: Broadcom BCM5785 [HT1000] LPC
00:03.0 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:03.1 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:03.2 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:04.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
00:07.0 PCI bridge: Broadcom HT2100 PCI-Express Bridge (rev a2)
00:08.0 PCI bridge: Broadcom HT2100 PCI-Express Bridge (rev a2)
00:09.0 PCI bridge: Broadcom HT2100 PCI-Express Bridge (rev a2)
00:0a.0 PCI bridge: Broadcom HT2100 PCI-Express Bridge (rev a2)
00:0b.0 PCI bridge: Broadcom HT2100 PCI-Express Bridge (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:00.0 PCI bridge: PLX Technology, Inc. PEX 8518 16-lane, 5-port PCI Express Switch (rev ac)
02:01.0 PCI bridge: PLX Technology, Inc. PEX 8518 16-lane, 5-port PCI Express Switch (rev ac)
02:02.0 PCI bridge: PLX Technology, Inc. PEX 8518 16-lane, 5-port PCI Express Switch (rev ac)
02:03.0 PCI bridge: PLX Technology, Inc. PEX 8518 16-lane, 5-port PCI Express Switch (rev ac)
03:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
05:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3)
06:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
07:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04)
08:0d.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge (rev c0)
08:0e.0 IDE interface: Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode)
0a:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
0a:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
0c:00.0 Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA (rev 03)
0c:00.1 Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA (rev 03)

Comment 4 Marcus Alves Grando 2009-05-22 12:43:44 UTC
Andy,

Something new about this?

Best regards

Comment 5 Andy Gospodarek 2009-05-26 20:19:50 UTC
Marcus, I do not have a fix yet.  I have a few ideas, but it will take some time before I can get to this one.  I will post here when I have a patch or test kernels available.

Comment 6 Andy Gospodarek 2009-05-26 20:31:32 UTC
Created attachment 345515 [details]
netpoll-napi-fix.patch

This patch may help.  I still need to check one more spot to be sure quota is set correctly, but I think it is.

I have not tested this, but this should be what we need.

Comment 9 Marcus Alves Grando 2009-05-27 21:23:25 UTC
(In reply to comment #6)
> Created an attachment (id=345515) [details]
> netpoll-napi-fix.patch
> 
> This patch may help.  I still need to check one more spot to be sure quota is
> set correctly, but I think it is.
> 
> I have not tested this, but this should be what we need.  

First tests works fine. I think that is. I'll do another tests and if found one problem, I'll notify.

Best regards.

Comment 10 Marcus Alves Grando 2009-06-12 14:18:09 UTC
Andy,

When can I expect a new kernel? We need this to running a SO supported by RH.

Best regards

Comment 11 Andy Gospodarek 2009-06-12 14:41:20 UTC
Marcus, I will try and add something to my test kernels this week.

I will check the schedule and see when I can get it into an official build.

Comment 12 Marcus Alves Grando 2009-08-03 19:35:23 UTC
Andy,

Did you try to include this patch to next kernel release?

Regards

Comment 13 Andy Gospodarek 2009-08-11 17:33:20 UTC
A patch to address this has been included in the latest RHEL5 development kernel, so this will be fixed when RHEL5.4 ships.

*** This bug has been marked as a duplicate of bug 511918 ***


Note You need to log in before you can comment on or make changes to this bug.