We have a crash reported with the following Oops message virbr1: port 2(vnet0) entering forwarding state vnet0: no IPv6 routers present Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: [<ffffffff8844d44a>] :bridge:br_nf_pre_routing_finish+0x20/0x301 PGD 1ee6a3b067 PUD 1ee6a79067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /class/net/lo/ifindex CPU 0 Modules linked in: tun ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables autofs4 nfs fscache nfs_acl lockd sunrpc 8021q bridge bonding ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport ksm(U) kvm_intel(U) kvm(U) hpilo e1000e tg3 bnx2 shpchp pcspkr serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Tainted: G 2.6.18-194.8.1.el5 #1 RIP: 0010:[<ffffffff8844d44a>] [<ffffffff8844d44a>] :bridge:br_nf_pre_routing_finish+0x20/0x301 RSP: 0018:ffffffff80446c20 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff811ecd16b820 RCX: ffffffff80000000 RDX: ffff812024811ec0 RSI: ffff81202f900ac0 RDI: ffff811ee5beb280 RBP: 0000000000000000 R08: ffff811ee5beb280 R09: ffffffff8844d42a R10: 0000000080000000 R11: ffffffff88579a89 R12: ffff81202a2dd000 R13: 0000000000000000 R14: ffff81202d58e000 R15: ffffffff804fa8c0 FS: 0000000000000000(0000) GS:ffffffff803ca000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000020 CR3: 000000202552b000 CR4: 00000000000026e0 Process swapper (pid: 0, threadinfo ffffffff803fa000, task ffffffff80308b60) Stack: ffffffff80446cd0 ffff811ee5beb280 ffffffff8844d42a 0000000000000000 ffff81202d58e000 ffffffff800568a0 ffffffff8844d42a 0000000080000000 0000000229bd0380 ffff81202f4e8000 ffffffff804fa640 ffff81202f900ac0 Call Trace: <IRQ> [<ffffffff8844d42a>] :bridge:br_nf_pre_routing_finish+0x0/0x301 [<ffffffff800568a0>] nf_hook_slow+0x58/0xbc [<ffffffff8844d42a>] :bridge:br_nf_pre_routing_finish+0x0/0x301 [<ffffffff8844e316>] :bridge:br_nf_pre_routing+0x604/0x622 [<ffffffff80034064>] nf_iterate+0x41/0x7d [<ffffffff8844992a>] :bridge:br_handle_frame_finish+0x0/0xf8 [<ffffffff800568a0>] nf_hook_slow+0x58/0xbc [<ffffffff8844992a>] :bridge:br_handle_frame_finish+0x0/0xf8 [<ffffffff88449b90>] :bridge:br_handle_frame+0x16e/0x1a2 [<ffffffff800208b8>] netif_receive_skb+0x383/0x49f [<ffffffff88214f7c>] :tg3:tg3_poll+0x826/0xd96 [<ffffffff8001320d>] rb_insert_color+0xb2/0xda [<ffffffff8000c88c>] net_rx_action+0xac/0x1e0 [<ffffffff88211598>] :tg3:tg3_msi+0x59/0x62 [<ffffffff800123b4>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006cb8e>] do_softirq+0x2c/0x85 [<ffffffff8006ca16>] do_IRQ+0xec/0xf5 [<ffffffff80056f62>] mwait_idle+0x0/0x4a [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff881e4a01>] :kvm_intel:vmx_interrupt_allowed+0x0/0x11 [<ffffffff80056f98>] mwait_idle+0x36/0x4a [<ffffffff80049150>] cpu_idle+0x95/0xb8 [<ffffffff80405807>] start_kernel+0x220/0x225 [<ffffffff8040522f>] _sinittext+0x22f/0x236 Code: f6 45 20 01 74 16 8a 87 9d 00 00 00 83 e0 f8 83 c8 03 88 87 RIP [<ffffffff8844d44a>] :bridge:br_nf_pre_routing_finish+0x20/0x301 RSP <ffffffff80446c20>
this is fixed by upstream commit e94c67436efa22af7d8b7d19c885863246042543 I think. I'll work on a backport in the AM. To answer your question, turning off that sysctl value will cause any iptables rules that are attached to a bridge interface to get skipped with an NF_ACCEPT code. so basically, by turning that value to 0, you bypass any iptables rules for frames arriving into the system via br0. I'll backport your patch in the am and have something for you test soon
I'm sorry, I take that last comment back, I misread, that doesnt fix this.
Hey, while I'm looking at this, I note that you have the e1000e and bnx2 modules loaded on this system. Can you tell me if gro or lro is enabled on any of those interfaces? If it is, please turn it off via the available module options for those modules and see if the problem reproduces.
OK I think this might be an upstream bug. It would appear that what had happend is that a bridge fragment has been reassembled with a non-bridge fragment, thus causing bridge to see a packet with nf_bridge == NULL. Again this just goes to show how the current bridge netfilter is broken by design. I will take this to netdev.
Yes, one fragment came in on eth6 and one on eth7. Please confirm this by giving us the network topology of the box.
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2620776 also, heres a debug kernel build. If the lro thing doesn't solve the problem, you can run this. It should cause the kernel to panic in the event that any context attempts to free or null the nf_bridge pointer of an skb while we're traversing the ip tables rules in that path. br_nf_pre_routing calls skb_share_check, which should ensure that we are the only user of this skb. My guess is that something in the iptables code is returning accept on that skb while at the same time using it for soemthing else, resulting in a null nf_bridge. This patch should more directly point to that culprit. Please try the lro settings first, and if that continnues to fail, move on to this build. Thanks!
herbert, you seem to have figured this out way ahead of me. I'll just pass this over to you. Thanks!
commit 8fa9ff6849bb86c59cc2ea9faadf3cb2d5223497 Author: Patrick McHardy <kaber> Date: Tue Dec 15 16:59:59 2009 +0100 netfilter: fix crashes in bridge netfilter caused by fragment jumps should fix the problem. We'll need to back port it.
Created attachment 437645 [details] first proposed patch
Created attachment 438988 [details] second proposed patch
I have what seems to be the same kernel oops on an x86_64 VM server running on a Dell PowerEdge R710. The machine seems to oops every couple of days since I upgraded to the 5.5 kernels. I am also a CentOS Developer and I have rebuilt the 2.6.18-194.11.1.el5 x86_64 kernel with the patch from comment #30. I will provide feedback in a couple of days as to whether or not I continue to get the kernel oops.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-219.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
I have not tested the kernel-2.6.18-219.el5, but I did want to feedback that I have had no kernel oops since adding the patch in comment #30 to 2.6.18-194.11.1.el5 on our Dell PowerEdge R710.
I have encountered the same problem on a IBM BladeCenter HS22. It happened a few times even when the system was seemingly not doing anything in particular. So we can reproduce this very easily. The work-around from the Red Hat Knowledgebase at: https://access.redhat.com/kb/docs/DOC-44616 hasn't caused a kernel panic, yet. More information when it is available :-)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html