Description of problem: kernel running NAT with kernel-2.6.25.3-18.fc9.x86_64 panic frequently with high CPU usage. Hardware: Dell 1950 Server. Intel 5130 @ 2.00GHz , 2CPU (4Core Total), 4G RAM 4 Broadcom 5708 Copper NIC (2 inboard, 2 insert) Software: Fedora 9 x86_64 Linux NAT Server with kernel-2.6.25.3-18.fc9.x86_64 boot up with error:(dmesg) startup dmesg report bug: ======BEGIN of startup dmesg bug====== IRQ handler type mismatch for IRQ 0 current handler: timer Pid: 1, comm: swapper Not tainted 2.6.25.3-18.fc9.x86_64 #1 Call Trace: [<ffffffff81072608>] setup_irq+0x1f0/0x20d [<ffffffff8114167d>] ? aer_irq+0x0/0x113 [<ffffffff810726ea>] request_irq+0xc5/0xee [<ffffffff8128630d>] aer_probe+0xc2/0x13c [<ffffffff8113fc17>] pcie_port_probe_service+0x3a/0x75 [<ffffffff811aba99>] driver_probe_device+0xc0/0x16e [<ffffffff811abbda>] __driver_attach+0x93/0xd3 [<ffffffff811abb47>] ? __driver_attach+0x0/0xd3 [<ffffffff811ab2b6>] bus_for_each_dev+0x4f/0x89 [<ffffffff8112d476>] ? kobject_get+0x1a/0x22 [<ffffffff811ab8e4>] driver_attach+0x1c/0x1e [<ffffffff811aab2d>] bus_add_driver+0xb7/0x200 [<ffffffff811abda3>] driver_register+0x5e/0xde [<ffffffff8113fb83>] pcie_port_service_register+0x47/0x49 [<ffffffff8147e5c7>] aer_service_init+0x1e/0x20 [<ffffffff81463506>] kernel_init+0x1f8/0x368 [<ffffffff8100ccf8>] child_rip+0xa/0x12 [<ffffffff8146330e>] ? kernel_init+0x0/0x368 [<ffffffff8100ccee>] ? child_rip+0x0/0x12 aer_probe: Request ISR fails on PCIE device[0000:00:05.0:pcie01] aer: probe of 0000:00:05.0:pcie01 failed with error -16 Load service driver aer on pcie device 0000:00:06.0:pcie01 IRQ handler type mismatch for IRQ 0 current handler: timer Pid: 1, comm: swapper Not tainted 2.6.25.3-18.fc9.x86_64 #1 Call Trace: [<ffffffff81072608>] setup_irq+0x1f0/0x20d [<ffffffff8114167d>] ? aer_irq+0x0/0x113 [<ffffffff810726ea>] request_irq+0xc5/0xee [<ffffffff8128630d>] aer_probe+0xc2/0x13c [<ffffffff8113fc17>] pcie_port_probe_service+0x3a/0x75 [<ffffffff811aba99>] driver_probe_device+0xc0/0x16e [<ffffffff811abbda>] __driver_attach+0x93/0xd3 [<ffffffff811abb47>] ? __driver_attach+0x0/0xd3 [<ffffffff811ab2b6>] bus_for_each_dev+0x4f/0x89 [<ffffffff8112d476>] ? kobject_get+0x1a/0x22 [<ffffffff811ab8e4>] driver_attach+0x1c/0x1e [<ffffffff811aab2d>] bus_add_driver+0xb7/0x200 [<ffffffff811abda3>] driver_register+0x5e/0xde [<ffffffff8113fb83>] pcie_port_service_register+0x47/0x49 [<ffffffff8147e5c7>] aer_service_init+0x1e/0x20 [<ffffffff81463506>] kernel_init+0x1f8/0x368 [<ffffffff8100ccf8>] child_rip+0xa/0x12 [<ffffffff8146330e>] ? kernel_init+0x0/0x368 [<ffffffff8100ccee>] ? child_rip+0x0/0x12 aer_probe: Request ISR fails on PCIE device[0000:00:07.0:pcie01] aer: probe of 0000:00:07.0:pcie01 failed with error -16 pci_hotplug: PCI Hot Plug PCI Core version: 0.5 =======END of startup dmesg bug============= When PC running with high net flow, sometimes it panic with (Top of panic dump goes out of screen ...) =======Begin of dump on NAT running system=========== [<ffffffff8100d06c>] call_softirq+0x1c/0x28 [<ffffffff8100e210>] do_softirq+0x34/0x72 [<ffffffff81039304>] irq_exit+0x3f/0x80 [<ffffffff8100e4dd>] do_IRQ+0x145/0x167 [<ffffffff8100a053>] ? mwait_idle+0x0/0x45 [<ffffffff8100afea>] ? default_idle+0x0/0x5f [<ffffffff8100c3f1>] ret_from_intr+0x0/0xa <EOI> [<ffffffff8100a093>] ? mwait_idle+0x40/0x45 [<ffffffff8100af28>] ? enter_idle+0x22/0x24 [<ffffffff8100afa2>] ? cpu_idle+0x78/0xc0 [<ffffffff81282726>] ? rest_init+0x5a/0x5c Code: 20 85 c0 0f 85 49 03 00 00 48 8b 45 c8 48 8b 00 48 89 45 c8 48 8b 45 c8 48 85 c0 74 32 48 8b 45 c8 48 8b 10 0f 18 0a 48 8b 50 20 <8a> 42 3e 3a 45 c6 75 d6 e9 43 ff ff ff 48 8d bd 70 ff ff ff 4c RIP [<ffffffff8821d8dd>] :nf_nat:nf_nat_setup_info+0x206/0x570 RSP <ffffffff814c17f0> CR2: 000000000000003e BUG: scheduling while atomic: swapper/0/0x10000200 CPU 0: Modules linked in: nf_nat_ftp xt_conntrack nf_conntrack_ftp fuse sunrpc xt_MARK iptable_mangle iptable_nat nf_nat ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6 loop dm_multipath dcdbas pcspkr serio_raw iTCO_wdt iTCO_vendor_support ses enclosure bnx2 sg button i5000_edac edac_core sr_mod cdrom ata_piix libata dm_snapshot dm_zero dm_mirror dm_mod shpchp megaraid_sas sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] Pid: 0, comm: swapper Tainted: G D 2.6.25.3-18.fc9.x86_64 #1 RIP: 0010:[<ffffffff8100a093>] [<ffffffff8100a093>] mwait_idle+0x40/0x45 RSP: 0018:ffffffff81459f28 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffffffff81459f28 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff813c10a0 RBP: ffff810001008c30 R08: 0000000000000000 R09: 0000000000000000 R10: ffff81000100c780 R11: ffffffff8148f3e0 R12: ffffffff8104aa80 R13: ffffffff81459ea8 R14: ffffffff81459eb8 R15: 00000000206f65cc FS: 0000000000000000(0000) GS:ffffffff813f6000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000000005003b CR2: 000000000000003e CR3: 0000000137928000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff8100af28>] ? enter_idle+0x22/0x24 [<ffffffff8100afa2>] ? cpu_idle+0x78/0xc0 [<ffffffff81282726>] ? rest_init+0x5a/0x5c ---[ end trace 6ccfd7ee7354d3ad ]--- Kernel panic - not syncing: Aiee, killing interrupt handler! =======END of dump on NAT running system=========== Workaround: Using kernel-2.6.24.7-92.fc8.x86_64 for two days, the error does not come out. Version-Release number of selected component (if applicable): kernel-2.6.25.3-18.fc9 How reproducible: Unknown. Maybe High net flow or DDos can reproduce Steps to Reproduce: 1. startup trace dump comes every time 2. 3. Actual results: System stack dump Expected results: No system dump during either start-up and NAT-running. Additional info: I tried to follow the symbol in kernel-debuginfo. System halt in /net/ipv4/netfilter/nf_nat_core.c same_src (inline of find_appropriate_src). I diff nf_nat_core.c code in 2.6.24 and 2.6.25, I found in find_appropriate_src function: =========== --- linux-2.6.24/net/ipv4/netfilter/nf_nat_core.c 2008-01-25 06:58:37.000000000 +0800 +++ linux-2.6.25.4/net/ipv4/netfilter/nf_nat_core.c 2008-05-15 23:00:12.000000000 +0800 @@ -150,8 +154,8 @@ find_appropriate_src(const struct nf_con struct nf_conn *ct; struct hlist_node *n; - read_lock_bh(&nf_nat_lock); - hlist_for_each_entry(nat, n, &bysource[h], bysource) { + rcu_read_lock(); + hlist_for_each_entry_rcu(nat, n, &bysource[h], bysource) { ct = nat->ct; if (same_src(ct, tuple)) { @@ -160,12 +164,12 @@ find_appropriate_src(const struct nf_con result->dst = tuple->dst; if (in_range(result, range)) { - read_unlock_bh(&nf_nat_lock); + rcu_read_unlock(); return 1; } } } - read_unlock_bh(&nf_nat_lock); + rcu_read_unlock(); return 0; } =========== In 2.6.24, Code has soft_irq disable (read_lock_bh), but in 2.6.25, Code only preempt_disable. According to kernel doc, rcu_read_lock_bh can prevent Some DDos attack.
I can confirm this on an HP DL180 G5 with a Xeon E5420 which is a budget HP server with an Intel chipset. If needed I can attach the dmesg.
The first two traces are an entirely unrelated problem with PCI AER (advanced error reporting.) This should be harmless.
same_src() is called with a bad value for ct in line 160.
Created attachment 308612 [details] Proposed patch from Patrick McHardy
Is this fixed in the latest update kernel?
This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.
All fixed. Thanks all.