Bug 501578
Summary: | Kernel BUG at include/linux/netdevice.h:921 - :e1000e:e1000_intr_msi+0xd2/0xdc | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Marcus Alves Grando <marcus> | ||||
Component: | kernel | Assignee: | Andy Gospodarek <agospoda> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 5.3 | CC: | nhorman, peterm | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-08-11 17:33:20 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Marcus Alves Grando
2009-05-19 20:03:13 UTC
Maybe it's the same problem as AS4 bz443034? Regards The bug-halt being hit is include/linux/netdevice.h:921 916 static inline void netif_rx_complete(struct net_device *dev) 917 { 918 unsigned long flags; 919 920 local_irq_save(flags); 921 BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state)); 922 list_del(&dev->poll_list); 923 smp_mb__before_clear_bit(); 924 clear_bit(__LINK_STATE_RX_SCHED, &dev->state); 925 local_irq_restore(flags); 926 } This panic happens specifically when find ourselves polling on a netdev that is not on the poll-list (or more specifically doesn't have the __LINK_STATE_RX_SCHED bit set when) netif_rx_complete is called. It is quite rare that this can happen (and I think is only a problem when netconsole is loaded), but I think it can happen like this: CPU0 CPU1 ---- ---- do_IRQ netpoll_send_skb do_softirq netpoll_poll call_softirq poll_napi __do_softirq spin_trylock(poll_lock) net_rx_action dev->poll (e1000_clean) spin_lock(poll_lock) netif_rx_complete spin_unlock(poll_lock) dev->poll (e1000_clean) netif_rx_complete BUG! Because this is on the receive path and is a bug-halt rather than a crash, I suspect bug 443034 is not related. How easy is this to reproduce? I've seen similar problems on older versions of e1000, but I was pretty sure this had been resolved by now. Have you tried to reproduce this without using netconsole? Yes Andy. It's easy reproducible. I just start Oracle+ocfs2+netconsole in five nodes and reboot one of them. Without netconsole I can't reproduce this. # lspci 00:01.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge 00:02.0 Host bridge: Broadcom BCM5785 [HT1000] Legacy South Bridge 00:02.1 IDE interface: Broadcom BCM5785 [HT1000] IDE 00:02.2 ISA bridge: Broadcom BCM5785 [HT1000] LPC 00:03.0 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01) 00:03.1 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01) 00:03.2 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01) 00:04.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) 00:07.0 PCI bridge: Broadcom HT2100 PCI-Express Bridge (rev a2) 00:08.0 PCI bridge: Broadcom HT2100 PCI-Express Bridge (rev a2) 00:09.0 PCI bridge: Broadcom HT2100 PCI-Express Bridge (rev a2) 00:0a.0 PCI bridge: Broadcom HT2100 PCI-Express Bridge (rev a2) 00:0b.0 PCI bridge: Broadcom HT2100 PCI-Express Bridge (rev a2) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:00.0 PCI bridge: PLX Technology, Inc. PEX 8518 16-lane, 5-port PCI Express Switch (rev ac) 02:01.0 PCI bridge: PLX Technology, Inc. PEX 8518 16-lane, 5-port PCI Express Switch (rev ac) 02:02.0 PCI bridge: PLX Technology, Inc. PEX 8518 16-lane, 5-port PCI Express Switch (rev ac) 02:03.0 PCI bridge: PLX Technology, Inc. PEX 8518 16-lane, 5-port PCI Express Switch (rev ac) 03:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3) 04:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) 05:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3) 06:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) 07:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04) 08:0d.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge (rev c0) 08:0e.0 IDE interface: Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode) 0a:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 0a:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 0c:00.0 Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA (rev 03) 0c:00.1 Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA (rev 03) Andy, Something new about this? Best regards Marcus, I do not have a fix yet. I have a few ideas, but it will take some time before I can get to this one. I will post here when I have a patch or test kernels available. Created attachment 345515 [details]
netpoll-napi-fix.patch
This patch may help. I still need to check one more spot to be sure quota is set correctly, but I think it is.
I have not tested this, but this should be what we need.
(In reply to comment #6) > Created an attachment (id=345515) [details] > netpoll-napi-fix.patch > > This patch may help. I still need to check one more spot to be sure quota is > set correctly, but I think it is. > > I have not tested this, but this should be what we need. First tests works fine. I think that is. I'll do another tests and if found one problem, I'll notify. Best regards. Andy, When can I expect a new kernel? We need this to running a SO supported by RH. Best regards Marcus, I will try and add something to my test kernels this week. I will check the schedule and see when I can get it into an official build. Andy, Did you try to include this patch to next kernel release? Regards A patch to address this has been included in the latest RHEL5 development kernel, so this will be fixed when RHEL5.4 ships. *** This bug has been marked as a duplicate of bug 511918 *** |