481076 – kernel BUG at net/ipv4/netfilter/ip_nat_core.c:308

Bug 481076 - kernel BUG at net/ipv4/netfilter/ip_nat_core.c:308

Summary: kernel BUG at net/ipv4/netfilter/ip_nat_core.c:308

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Herbert Xu
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-01-22 01:48 UTC by Shad L. Lords
Modified:	2009-09-02 08:57 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 08:57:41 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
[NETFILTER]: nf_nat: don't add NAT extension for confirmed conntracks (2.36 KB, patch) 2009-02-10 05:51 UTC, Herbert Xu	no flags	Details \| Diff
[NETFILTER]: nf_nat: don't add NAT extension for confirmed conntracks (3.28 KB, patch) 2009-02-10 05:59 UTC, Herbert Xu	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Shad L. Lords 2009-01-22 01:48:17 UTC

Description of problem:

about 30% of the time that I reboot a cluster node the node will crash and reboot.  Some times it will do the crash/reboot cycle 2-3 times before it stabilizes.

Version-Release number of selected component (if applicable):

[root@xen32-4 ~]# rpm -q kernel-xen iptables cman xen
kernel-xen-2.6.18-92.1.18.el5
kernel-xen-2.6.18-92.1.22.el5
iptables-1.3.5-4.el5
cman-2.0.84-2.el5_2.3
xen-3.0.3-64.el5_2.9
[root@xen32-4 ~]# uname -r
2.6.18-92.1.22.el5xen


How reproducible:

About 30% of the time

Steps to Reproduce:
1. Setup cluster (currently 5, 3 x i386, 2 x x86_64)
2. Reboot one of the i386 nodes
  
Actual results:

Node crashes upon joining the cluster

Expected results:

Node reboots without issues

Additional info:

Here is the last part of the serial console before it crashes:

Starting anacron: [  OK  ]
Starting libvirtd daemon: [  OK  ]
virbr0: Dropping NETIF_F_UFO since no NETIF_F_HW_CSUM feature.
Starting yum-updatesd: [  OK  ]
Starting HAL daemon: ------------[ cut here ]------------
kernel BUG at net/ipv4/netfilter/ip_nat_core.c:308!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /devices/pci0000:00/0000:00:1e.0/0000:01:0c.0/class
Modules linked in: ipt_MASQUERADE iptable_nat ip_nat ipt_REJECT autofs4 ipmi_watchdog ipmi_devintf ipmi_si ipmi_msghandler gfs(U) lock_dlm gfs2 dlm configfs bridge netloop netbk sunrpc dm_round_robin ip_conntrack_netbios_ns xt_tcpudp xt_state ip_conntrack nfnetlink xt_multiport iptable_filter ip_tables x_tables ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_multipath raid0 video sbs backlight i2c_ec button battery asus_acpi ac parport_pc lp parport floppy st ata_piix e7xxx_edac edac_mc sg libata serio_raw i2c_i801 i2c_core pcspkr e1000 dm_snapshot dm_zero dm_mirror dm_mod aic79xx scsi_transport_spi sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    2
EIP:    0061:[<ee47b2f9>]    Tainted: G      VLI
EFLAGS: 00010202   (2.6.18-92.1.22.el5xen #1)
EIP is at ip_nat_setup_info+0x3f/0x47a [ip_nat]
eax: 00000001   ebx: c035f060   ecx: 00000000   edx: c071fe01
esi: 00000000   edi: c071fed4   ebp: c035f060   esp: c071fdc8
ds: 007b   es: 007b   ss: 0069
Process swapper (pid: 0, ti=c071f000 task=c7746000 task.ti=c12a5000)
Stack: ed0e007b c067007b c071fe24 00000108 00000001 ed7c2000 c035f06c ed7c2000
       c04291e3 00000000 00000000 c035f06c c035f060 00000000 00001d4c ee321cb1
       00000001 ed5ed380 c035f060 00000000 c071fed4 00000002 ee4150a6 00000001
Call Trace:
 [<c04291e3>] __mod_timer+0x99/0xa3
 [<ee321cb1>] __ip_ct_refresh_acct+0xf6/0x129 [ip_conntrack]
 [<ee4150a6>] alloc_null_binding_confirmed+0x4c/0x51 [iptable_nat]
 [<ee4153dd>] ip_nat_fn+0x131/0x185 [iptable_nat]
 [<ee415481>] ip_nat_in+0x1c/0x7b [iptable_nat]
5c9034>] ip_rcv_finish+0x0/0x280
 [<c05c43c4>] nf_iterate+0x30/0x61
 [<c05c9034>] ip_rcv_fini44ea>] nf_hook_slow+0x3a/0x90
 [<c05c9034>] ip_rcv_finish+0x0/0x280
 [<c05c970b>] ip_rcv+0x2034>] ip_rcv_finish+0x0/0x280
 [<c05af046>] netif_receive_skb+0x2cd/0x341
 [<ee39d666>] br_pas[bridge]
 [<ee39d70e>] br_handle_frame_finish+0xa6/0xcf [bridge]
 [<ee39d87d>] br_handle_fram]
 [<c05aefc7>] netif_receive_skb+0x24e/0x341
 [<ee1591c3>] e1000_clean_rx_irq+0x399/0x473 [ee1000_clean+0x6b/0x22e [e1000]
 [<c05b0a51>] net_rx_action+0x96/0x185
 [<c04260c6>] __do_soft06edf>] do_softirq+0x56/0xaf
 [<c0406e80>] do_IRQ+0xa5/0xae
 [<c0549faf>] evtchn_do_upcall+0x5d9>] hypervisor_callback+0x3d/0x48
 [<c0408632>] raw_safe_halt+0x8c/0xaf
 [<c040321a>] xen_i03339>] cpu_idle+0x91/0xab
 =======================
Code: 95 c2 89 44 24 0c 31 c0 83 f9 01 0f 10 75 08 8b 45 08 c1 e8 07 eb 06 8b 45 08 c1 e8 08 83 e0 01 85 c0 74 08 <0f> 0b 34 01 4e c6 4700 8d 44 24 38 e8 cc 7a
EIP: [<ee47b2f9>] ip_nat_setup_info+0x3f/0x47a [ip_nat] SS:ESP 0069:cnic - not syncing: Fatal exception in interrupt
 (XEN) Domain 0 crashed: rebooting machine in

Comment 1 Thomas Woerner 2009-01-23 15:53:10 UTC

This is a kernel problem, reassigning.

Comment 2 Herbert Xu 2009-02-09 06:30:16 UTC

Have you got bridge netfilter turned on (/proc/sys/net/bridge/bridge-nf-*)?

Comment 3 Shad L. Lords 2009-02-09 13:45:04 UTC

It appears that I do.  

[root@xen64-6 ~]# ll /proc/sys/net/bridge/bridge-nf-*
-rw-r--r-- 1 root root 0 Feb  9 06:43 /proc/sys/net/bridge/bridge-nf-call-arptables
-rw-r--r-- 1 root root 0 Feb  9 06:43 /proc/sys/net/bridge/bridge-nf-call-ip6tables
-rw-r--r-- 1 root root 0 Feb  9 06:43 /proc/sys/net/bridge/bridge-nf-call-iptables
-rw-r--r-- 1 root root 0 Feb  9 06:43 /proc/sys/net/bridge/bridge-nf-filter-vlan-tagged
[root@xen64-6 ~]# cat /proc/sys/net/bridge/bridge-nf-*
0
0
0
1

Comment 4 Herbert Xu 2009-02-10 05:19:05 UTC

Hmm, you reported the problem under 32-bit.  Have you seen this crash on xen64-6 as well?

Comment 5 Herbert Xu 2009-02-10 05:30:43 UTC

Nevermind, I've found a problem in RHEL5 that can cause this.

Comment 6 Herbert Xu 2009-02-10 05:51:30 UTC

Created attachment 331395 [details]
[NETFILTER]: nf_nat: don't add NAT extension for confirmed conntracks

commit 8c87238b726e543f8af4bdb4296020a328df4744
Author: Patrick McHardy <kaber>
Date:   Mon Apr 14 11:15:51 2008 +0200

    [NETFILTER]: nf_nat: don't add NAT extension for confirmed conntracks

    Adding extensions to confirmed conntracks is not allowed to avoid races
    on reallocation. Don't setup NAT for confirmed conntracks in case NAT
    module is loaded late.

    The has one side-effect, the connections existing before the NAT module
    was loaded won't enter the bysource hash. The only case where this actually
    makes a difference is in case of SNAT to a multirange where the IP before
    NAT is also part of the range. Since old connections don't enter the
    bysource hash the first new connection from the IP will have a new address
    selected. This shouldn't matter at all.

    Signed-off-by: Patrick McHardy <kaber>

Comment 7 Herbert Xu 2009-02-10 05:59:21 UTC

Created attachment 331396 [details]
[NETFILTER]: nf_nat: don't add NAT extension for confirmed conntracks

Comment 8 Shad L. Lords 2009-02-10 22:07:09 UTC

Not sure if this is the same thing but this is what I'm seeing from a 64-bit box.  It doesn't crash during the bootup process like the 32-bit box does.  It actually gets to a login prompt and will sit there for 5-10 seconds before crashing.

list_del corruption. next->prev should be ffff8800
1e3e7848, but was ffffc20000096e70
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lib/list_debug.c:70
invalid opcode: 0000 [1] SMP
last sysfs file: /block/hda/removable
CPU 0
Modules linked in: blktap blkbk ipt_MASQUERADE iptable_nat ip_nat ipt_REJECT aut
ofs4 ipmi_devintf ipmi_si ipmi_msghandler gfs(U) lock_dlm gfs2(U) dlm configfs b
ridge netloop netbk sunrpc dm_round_robin sd_mod sg ip_conntrack_netbios_ns xt_t
cpudp xt_state ip_conntrack nfnetlink xt_multiport iptable_filter ip_tables x_ta
bles ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi
 scsi_transport_iscsi scsi_mod dm_multipath video sbs backlight i2c_ec button ba
ttery asus_acpi ac parport_pc lp parport i2c_amd756 i2c_amd8111 k8temp tg3 k8_ed
ac amd_rng shpchp pcspkr i2c_core hwmon edac_mc serio_raw dm_snapshot dm_zero dm
_mirror dm_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Tainted: G      2.6.18-92.1.22.el5xen #1
RIP: e030:[<ffffffff8033a54f>]  [<ffffffff8033a54f>] list_del+0x48/0x71
RSP: e02b:ffffffff8062fe80  EFLAGS: 00010286
RAX: 0000000000000058 RBX: ffff88001e3e7848 RCX: ffffffff804dd7a8
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffffffff80680800 R08: ffffffff804dd7a8 R09: 0000000000004045
R10: 0000000000000010 R11: ffffffff8034dded R12: 0000000000000100
R13: ffffffff8833a07f R14: fffffffffffffffe R15: 0000000000000000
FS:  00002b40db8b2250(0000) GS:ffffffff805b0000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process swapper (pid: 0, threadinfo ffffffff805f0000, task ffffffff804d8b00)
Stack:  ffff88001e3e7788  ffffffff885720cc  ffff88001e3e7788  ffffffff8833b10c
 ffff88001e3e7788  ffffffff80292b1d  ffffffff8062feb0  ffffffff8062feb0
 00000100805f1ea8  0000000000000001
Call Trace:
<IRQ>  [<ffffffff885720cc>] :ip_nat:ip_nat_cleanup_conntrack+0x26/0x35
 [<ffffffff8833b10c>] :ip_conntrack:destroy_conntrack+0x60/0xdc
 [<ffffffff80292b1d>] run_timer_softirq+0x13f/0x1c6
 [<ffffffff80212802>] __do_softirq+0x62/0xde
 [<ffffffff80260da4>] call_softirq+0x1c/0x278
 [<ffffffff8026dcd2>] do_softirq+0x31/0x98
 [<ffffffff8026db4d>] do_IRQ+0xec/0xf5
 [<ffffffff803a0c69>] evtchn_do_upcall+0x86/0xe0
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff8026082b>] error_exit+0x0/0x6e
 [<ffffffff8026f139>] raw_safe_halt+0x84/0xa8
 [<ffffffff8026c683>] xen_idle+0x38/0x4a
 [<ffffffff8024aa8e>] cpu_idle+0x97/0xba
 [<ffffffff805fab09>] start_kernel+0x21f/0x224
 [<ffffffff805fa1e5>] _sinittext+0x1e5/0x1eb


Code: 0f 0b 68 f5 ff 48 80 c2 46 00 48 8b 13 48 8b 43 08 48 89 42
RIP  [<ffffffff8033a54f>] list_del+0x48/0x71
 RSP <ffffffff8062fe80>
 <0>Kernel panic - not syncing: Fatal exception
 (XEN) Domain 0 crashed: rebooting machine in 5 seconds.

Comment 9 RHEL Program Management 2009-02-11 10:10:04 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Herbert Xu 2009-02-12 11:19:39 UTC

Shad, can you try the patch given here and see if it helps either the 32-bit case or the 64-bit one? Thanks!

Comment 11 Shad L. Lords 2009-02-12 14:38:57 UTC

I'm not able to try the patch.  However if you can build a kernel and put it somewhere I'd be able to try it.

Comment 12 RHEL Program Management 2009-02-16 15:42:34 UTC

Updating PM score.

Comment 13 Don Zickus 2009-03-04 20:01:18 UTC

in kernel-2.6.18-133.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 16 errata-xmlrpc 2009-09-02 08:57:41 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.