Bug 516985 - When bonding is used and IPV6 is enabled the message of 'kernel: bond0: duplicate address detected!' is output
Summary: When bonding is used and IPV6 is enabled the message of 'kernel: bond0: dupli...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Herbert Xu
QA Contact: Network QE
URL:
Whiteboard:
Depends On: 236750
Blocks: 525215 533192 557926
TreeView+ depends on / blocked
 
Reported: 2009-08-12 07:50 UTC by Chris Ward
Modified: 2018-11-14 18:07 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 236750
: 614240 (view as bug list)
Environment:
Last Closed: 2011-01-13 20:52:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sosreport of other labs failing system (591.13 KB, application/x-bzip2)
2009-10-02 13:34 UTC, rob_thomas
no flags Details
bonding-ipv6-fixup.patch (2.22 KB, patch)
2010-02-03 18:14 UTC, Andy Gospodarek
no flags Details | Diff
CPU stuck for 10s! log (176.69 KB, patch)
2010-03-03 21:22 UTC, rob_thomas
no flags Details | Diff
patches for ipv6 bonding issue. (3.80 KB, application/x-gzip)
2010-04-15 15:53 UTC, rob_thomas
no flags Details
Trace log (9.95 KB, application/octet-stream)
2010-05-12 19:47 UTC, Shyam Iyer
no flags Details
516985-test0.patch (9.00 KB, patch)
2010-05-12 21:11 UTC, Andy Gospodarek
no flags Details | Diff
panic trace with upstream backport (7.42 KB, application/octet-stream)
2010-05-13 20:40 UTC, Shyam Iyer
no flags Details
Use states in DAD logic to prevent the IPv6 lockups (6.96 KB, patch)
2010-07-14 16:35 UTC, Shyam Iyer
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Comment 1 Chris Ward 2009-08-12 07:52:01 UTC
Mennyh, 

please provide us with additional details. This issue has been re-opened.

Comment 2 Menny Hamburger 2009-08-12 08:04:41 UTC
We got this together with this gc bug - I do not think they are realted.
http://www.mail-archive.com/git-commits-head@vger.kernel.org/msg23418.html

2009 Aug 11 01:10:13 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:12:12 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:12:12 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:12:42 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:14:58 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:14:58 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:15:41 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:15:41 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:16:29 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:19:30 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:19:30 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:19:30 node0 INFO: kernel: bond0: duplicate address detected!
2009 Aug 11 01:19:40 node0 NOTICE: kernel: nfs: server localhost not
responding, still trying
2009 Aug 11 01:19:40 node0 MAJOR: kernel: BUG: soft lockup - CPU#4 stuck for
10s! [swapper:0]
2009 Aug 11 01:19:40 node0 WARNING: kernel:
2009 Aug 11 01:19:40 node0 WARNING: kernel: Pid: 0, comm:              swapper
2009 Aug 11 01:19:40 node0 WARNING: kernel: EIP: 0060:[<c060d588>] CPU: 4
2009 Aug 11 01:19:40 node0 WARNING: kernel: EIP is at dst_destroy+0x8/0xd0
2009 Aug 11 01:19:40 node0 WARNING: kernel: EFLAGS: 00000246    Tainted: G     
 (2.6.18-128sys #1)
2009 Aug 11 01:19:40 node0 WARNING: kernel: EAX: f5283780 EBX: 00000000 ECX:
00000001 EDX: f5283780
2009 Aug 11 01:19:40 node0 WARNING: kernel: ESI: f5283780 EDI: c07a9fc4 EBP:
c060d6b0 DS: 007b ES: 007b
2009 Aug 11 01:19:40 node0 WARNING: kernel: CR0: 8005003b CR2: b60fdffc CR3:
0079f000 CR4: 000006f0
2009 Aug 11 01:19:40 node0 WARNING: kernel: [<c060d762>] dst_run_gc+0xb2/0x110
2009 Aug 11 01:19:40 node0 WARNING: kernel: [<c0433961>]
run_timer_softirq+0x111/0x190
2009 Aug 11 01:19:40 node0 WARNING: kernel: [<c043b908>]
__rcu_process_callbacks+0x108/0x1a0
2009 Aug 11 01:19:40 node0 WARNING: kernel: [<c042f530>]
__do_softirq+0x80/0x150
2009 Aug 11 01:19:40 node0 WARNING: kernel: [<c0407dbd>] do_softirq+0x6d/0xc0
2009 Aug 11 01:19:40 node0 WARNING: kernel: [<c0405e07>]
apic_timer_interrupt+0x1f/0x24
2009 Aug 11 01:19:40 node0 WARNING: kernel: [<c0564ffe>]
acpi_safe_halt+0x14/0x20
2009 Aug 11 01:19:40 node0 WARNING: kernel: [<c056519e>]
acpi_processor_idle+0x13e/0x364
2009 Aug 11 01:19:40 node0 WARNING: kernel: [<c0565064>]
acpi_processor_idle+0x4/0x364
2009 Aug 11 01:19:40 node0 WARNING: kernel: [<c0403f04>] cpu_idle+0x74/0xd0
2009 Aug 11 01:19:40 node0 WARNING: kernel: =======================
2009 Aug 11 01:19:41 node0 MAJOR: kernel: BUG: soft lockup - CPU#2 stuck for
10s! [swapper:0]
2009 Aug 11 01:19:41 node0 WARNING: kernel:
2009 Aug 11 01:19:41 node0 WARNING: kernel: Pid: 0, comm:              swapper
2009 Aug 11 01:19:41 node0 WARNING: kernel: EIP: 0060:[<c067817b>] CPU: 2
2009 Aug 11 01:19:41 node0 WARNING: kernel: EIP is at
__read_lock_failed+0x3/0x18
2009 Aug 11 01:19:41 node0 WARNING: kernel: EFLAGS: 00000297    Tainted: G     
 (2.6.18-128sys #1)
2009 Aug 11 01:19:41 node0 WARNING: kernel: EAX: f902370c EBX: f902370c ECX:
00000005 EDX: c07a7e60
2009 Aug 11 01:19:41 node0 WARNING: kernel: ESI: c07a7e3c EDI: 00000005 EBP:
c07a7e58 DS: 007b ES: 007b
2009 Aug 11 01:19:41 node0 WARNING: kernel: CR0: 8005003b CR2: 00446a30 CR3:
0079f000 CR4: 000006f0
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c067a905>]
_read_lock_bh+0x15/0x20
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8ff2007>]
ip6_pol_route_input+0x47/0x1c0 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8ff1fc0>]
ip6_pol_route_input+0x0/0x1c0 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f900d994>]
fib6_rule_action+0x84/0xf0 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f900d910>]
fib6_rule_action+0x0/0xf0 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0617a24>]
fib_rules_lookup+0x64/0x90
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f900dbdf>]
fib6_rule_lookup+0x2f/0x80 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8ff1fc0>]
ip6_pol_route_input+0x0/0x1c0 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8ff04ba>]
ip6_route_input+0xea/0x100 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8fe772d>] ipv6_rcv+0x38d/0x3e0
[ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8fe73a0>] ipv6_rcv+0x0/0x3e0
[ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0609dc4>]
netif_receive_skb+0x2c4/0x450
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f890a687>]
igb_clean_rx_irq_adv+0x4d7/0x690 [igb]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f890c100>]
igb_clean_rx_ring_msix+0x40/0x1f0 [igb]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8915783>]
__kc_adapter_clean+0x23/0x40 [igb]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c060bbf0>]
net_rx_action+0xc0/0x1e0
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c042f530>]
__do_softirq+0x80/0x150
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0407dbd>] do_softirq+0x6d/0xc0
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0406faa>] do_nmi+0xaa/0x290
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0459d30>] __do_IRQ+0x0/0x110
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0407e9c>] do_IRQ+0x8c/0x100
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0405d76>]
common_interrupt+0x1a/0x20
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0564ffe>]
acpi_safe_halt+0x14/0x20
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c056519e>]
acpi_processor_idle+0x13e/0x364
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0403f04>] cpu_idle+0x74/0xd0
2009 Aug 11 01:19:41 node0 WARNING: kernel: =======================
2009 Aug 11 01:19:41 node0 MAJOR: kernel: BUG: soft lockup - CPU#1 stuck for
10s! [swapper:0]
2009 Aug 11 01:19:41 node0 WARNING: kernel:
2009 Aug 11 01:19:41 node0 WARNING: kernel: Pid: 0, comm:              swapper
2009 Aug 11 01:19:41 node0 WARNING: kernel: EIP: 0060:[<c067a9bf>] CPU: 1
2009 Aug 11 01:19:41 node0 WARNING: kernel: EIP is at _spin_lock_bh+0xf/0x20
2009 Aug 11 01:19:41 node0 WARNING: kernel: EFLAGS: 00000286    Tainted: G     
 (2.6.18-128sys #1)
2009 Aug 11 01:19:41 node0 WARNING: kernel: EAX: c07a6000 EBX: c0710578 ECX:
01000001 EDX: f41ea680
2009 Aug 11 01:19:41 node0 WARNING: kernel: ESI: 00000000 EDI: f5283514 EBP:
f44d92a0 DS: 007b ES: 007b
2009 Aug 11 01:19:41 node0 WARNING: kernel: CR0: 8005003b CR2: 098e92e8 CR3:
0079f000 CR4: 000006f0
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c060d65d>] __dst_free+0xd/0x60
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8ff4058>] fib6_add+0x518/0x610
[ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c060ec1c>] neigh_lookup+0xbc/0xd0
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8ff0b26>] ip6_ins_rt+0x46/0x70
[ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8ff20c9>]
ip6_pol_route_input+0x109/0x1c0 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8ff1fc0>]
ip6_pol_route_input+0x0/0x1c0 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f900d994>]
fib6_rule_action+0x84/0xf0 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f900d910>]
fib6_rule_action+0x0/0xf0 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0617a24>]
fib_rules_lookup+0x64/0x90
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f900dbdf>]
fib6_rule_lookup+0x2f/0x80 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8ff1fc0>]
ip6_pol_route_input+0x0/0x1c0 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8ff04ba>]
ip6_route_input+0xea/0x100 [ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8fe772d>] ipv6_rcv+0x38d/0x3e0
[ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8fe73a0>] ipv6_rcv+0x0/0x3e0
[ipv6]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0609dc4>]
netif_receive_skb+0x2c4/0x450
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f890a687>]
igb_clean_rx_irq_adv+0x4d7/0x690 [igb]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8907776>]
igb_set_itr+0x106/0x160 [igb]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0441166>]
hrtimer_run_queues+0x76/0x1a0
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f890c100>]
igb_clean_rx_ring_msix+0x40/0x1f0 [igb]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<f8915783>]
__kc_adapter_clean+0x23/0x40 [igb]
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c060bbf0>]
net_rx_action+0xc0/0x1e0
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c042f530>]
__do_softirq+0x80/0x150
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0407dbd>] do_softirq+0x6d/0xc0
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0406faa>] do_nmi+0xaa/0x290
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0459d30>] __do_IRQ+0x0/0x110
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0407e9c>] do_IRQ+0x8c/0x100
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0405d76>]
common_interrupt+0x1a/0x20
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c056524a>]
acpi_processor_idle+0x1ea/0x364
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c05653bf>]
acpi_processor_idle+0x35f/0x364
2009 Aug 11 01:19:41 node0 WARNING: kernel: [<c0403f04>] cpu_idle+0x74/0xd0
2009 Aug 11 01:19:41 node0 WARNING: kernel: =======================
2009 Aug 11 01:19:42 node0 MAJOR: kernel: BUG: soft lockup - CPU#7 stuck for
10s! [swapper:0]
2009 Aug 11 01:19:42 node0 WARNING: kernel:
2009 Aug 11 01:19:42 node0 WARNING: kernel: Pid: 0, comm:              swapper
2009 Aug 11 01:19:42 node0 WARNING: kernel: EIP: 0060:[<c067817d>] CPU: 7
2009 Aug 11 01:19:42 node0 WARNING: kernel: EIP is at
__read_lock_failed+0x5/0x18
2009 Aug 11 01:19:42 node0 WARNING: kernel: EFLAGS: 00000297    Tainted: G     
 (2.6.18-128sys #1)
2009 Aug 11 01:19:42 node0 WARNING: kernel: EAX: f902370c EBX: f902370c ECX:
00000005 EDX: c07ace14
2009 Aug 11 01:19:42 node0 WARNING: kernel: ESI: c07acdf0 EDI: 00000005 EBP:
c07ace0c DS: 007b ES: 007b
2009 Aug 11 01:19:42 node0 WARNING: kernel: CR0: 8005003b CR2: 00d710a0 CR3:
0079f000 CR4: 000006f0
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c067a905>]
_read_lock_bh+0x15/0x20
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<f8ff2007>]
ip6_pol_route_input+0x47/0x1c0 [ipv6]
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<f8ff1fc0>]
ip6_pol_route_input+0x0/0x1c0 [ipv6]
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<f900d994>]
fib6_rule_action+0x84/0xf0 [ipv6]
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<f900d910>]
fib6_rule_action+0x0/0xf0 [ipv6]
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c0617a24>]
fib_rules_lookup+0x64/0x90
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<f900dbdf>]
fib6_rule_lookup+0x2f/0x80 [ipv6]
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<f8ff1fc0>]
ip6_pol_route_input+0x0/0x1c0 [ipv6]
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<f8ff04ba>]
ip6_route_input+0xea/0x100 [ipv6]
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<f8fe772d>] ipv6_rcv+0x38d/0x3e0
[ipv6]
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c064db83>] arp_rcv+0xa3/0x130
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<f8fe73a0>] ipv6_rcv+0x0/0x3e0
[ipv6]
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c0609dc4>]
netif_receive_skb+0x2c4/0x450
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c0605523>] __alloc_skb+0x53/0x110
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<f899c8d6>] bnx2_poll+0x546/0x1160
[bnx2]
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c0424a4c>]
__build_sched_domains+0x26c/0xdd0
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c0407e9c>] do_IRQ+0x8c/0x100
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c060bbf0>]
net_rx_action+0xc0/0x1e0
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c042f530>]
__do_softirq+0x80/0x150
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c0407dbd>] do_softirq+0x6d/0xc0
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c0405e07>]
apic_timer_interrupt+0x1f/0x24
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c056524a>]
acpi_processor_idle+0x1ea/0x364
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c05653c0>]
acpi_processor_idle+0x360/0x364
2009 Aug 11 01:19:42 node0 WARNING: kernel: [<c0403f04>] cpu_idle+0x74/0xd0 

Since we do have some patches over bonding, I will try to see if they are the reason for this. 
What was the patch on bonding that required this change?

Comment 3 rob_thomas 2009-09-21 18:46:29 UTC
Duplicated with RHEL5U4.

Sep 20 13:34:46 host kernel: bond1: IPv6 duplicate address detected!
Sep 20 13:35:46 host last message repeated 5 times
Sep 20 13:37:16 host last message repeated 12 times
Sep 20 13:38:18 host last message repeated 6 times
Sep 20 13:39:28 host last message repeated 5 times
Sep 20 13:39:28 host last message repeated 2 times
Sep 20 13:39:38 host kernel: BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
Sep 20 13:39:38 host kernel:
Sep 20 13:39:38 host kernel: Pid: 0, comm:              swapper
Sep 20 13:39:38 host kernel: EIP: 0060:[<c046edf0>] CPU: 0
Sep 20 13:39:38 host kernel: EIP is at kmem_cache_free+0x70/0x74
Sep 20 13:39:38 host kernel:  EFLAGS: 00000246    Not tainted  (2.6.18-164.el5PAE #1)
Sep 20 13:39:38 host kernel: EAX: 00000078 EBX: f5b77000 ECX: f7c37080 EDX: c16557e0
Sep 20 13:39:38 host kernel: ESI: 00000246 EDI: f2abf280 EBP: 0000000a DS: 007b ES: 007b
Sep 20 13:39:38 host kernel: CR0: 8005003b CR2: 00c92594 CR3: 00737000 CR4: 000006f0
Sep 20 13:39:38 host kernel:  [<c05bda7e>] dst_destroy+0x86/0xb2
Sep 20 13:39:38 host kernel:  [<c05bdb09>] dst_run_gc+0x0/0xee
Sep 20 13:39:38 host kernel:  [<c05bdb57>] dst_run_gc+0x4e/0xee
Sep 20 13:39:38 host kernel:  [<c042c7cc>] run_timer_softirq+0xfb/0x151
Sep 20 13:39:38 host kernel:  [<c04292fb>] __do_softirq+0x87/0x114
Sep 20 13:39:38 host kernel:  [<c04073bb>] do_softirq+0x52/0x9c
Sep 20 13:39:38 host kernel:  [<c04059d7>] apic_timer_interrupt+0x1f/0x24
Sep 20 13:39:38 host kernel:  [<c052aa90>] acpi_safe_halt+0x14/0x20
Sep 20 13:39:38 host kernel:  [<c052ac30>] acpi_processor_idle+0x13e/0x386
Sep 20 13:39:38 host kernel:  [<c0403ca8>] cpu_idle+0x9f/0xb9
Sep 20 13:39:38 host kernel:  [<c06fd9f0>] start_kernel+0x37b/0x383
Sep 20 13:39:38 host kernel:  =======================

Comment 4 rob_thomas 2009-09-28 16:38:31 UTC
This bug appears to be identical to bug 489895 except for the duplicate address messages.

https://bugzilla.redhat.com/show_bug.cgi?id=489895

Comment 5 Andy Gospodarek 2009-09-30 00:17:02 UTC
Rob, do you have any idea what the uptime is like for a system when this
happens and if there a lot of entries in the route table?

I see how a long run of dst_run_gc (one that takes more than 10s), could lock
out the other threads (since the dst_lock will be held by dst_run_gc and other
threads will need it), so I just want to understand your failure environment a
bit better.

Thanks!

Comment 6 rob_thomas 2009-09-30 13:46:48 UTC
Hi Andy,

The lab that is currently seeing the error experiences the failure within 3 hours.  He is not running any IPv6 traffic.  The server is just configured and sitting on the network.  The router is assigning a prefix which gives the interfaces a global address.  He has also confirmed that if the bond is removed, he does not see a kernel panic (24 hours).

I talked to the team that reported bug 489895 and have got a hold of their program that induces the error within minutes instead of hours/days as originally reported in that bug.  However, I have been unable to duplicate this issue with either U3 or U4 even with accelerated traffic from the test program.

The test program that the other team created actually uses captured packets that causes the error and blasts them out to the ff:ff:ff:ff:ff:ff.

I have already asked if I'm able to share this program with you and you are welcome to it.  I just want to be able to have reproduced the error for me first so that I can be in a better position to help.  However, you are more than welcome to the program if you wish.

Rob

Comment 7 Andy Gospodarek 2009-09-30 14:23:38 UTC
3 hours...wow.  Can you send me a sys/sosreport?

I'd like to try and setup an identical system in our lab to see if I can reproduce it.

Comment 8 Shyam kumar Iyer 2009-09-30 19:31:51 UTC
Andy,

We have discussed internally on this issue and this is exactly the same as bug 489895 and the duplicate address warnings are common for the both the bugs as opposed to comment#4.

Just so that you are aware we have already tried patch from comment# 36 from bug 489895.

Some of my observations noted in the older bug that could be relooked here-

"
Shouldn't we disable the interface or ipv6 if a duplicate address is detected?

Excerpts from rfc4429

5.4.5.  When Duplicate Address Detection Fails

   A tentative address that is determined to be a duplicate as described
   above MUST NOT be assigned to an interface, and the node SHOULD log a
   system management error.

<Shyam> We do this by throwing up the duplicate address warning. So we are
green here. </Shyam>

   If the address is a link-local address formed from an interface
   identifier based on the hardware address, which is supposed to be
   uniquely assigned (e.g., EUI-64 for an Ethernet interface), IP
   operation on the interface SHOULD be disabled.  By disabling IP
   operation, the node will then:

   -  not send any IP packets from the interface,
<Shyam>This is point no. 1. We don't do this</Shyam>

   -  silently drop any IP packets received on the interface, and
<Shyam>This is point no. 2. We don't do this too.</Shyam> 
   -  not forward any IP packets to the interface (when acting as a
      router or processing a packet with a Routing header).



Point no.1 and 2 are not done I guess because we don't do any of the following
-

1) disable the interface
2) disable ipv6.

So, I guess we are did not solve the duplicate address problem by just the
IFF_SLAVE patch.

We need to take care of switches which could get confused in a setup not
configured for bonding(ports are not channel grouped/trunked in this setup).
These switches could keep the network confused and we will have duplicate
addresses all around.

I believe a race condition is created because of this happening in ref_cnts of
the dst entries and thus causing the crash.


So, I think the best thing to do then is to consider the options detailed in
the thread for the options considered
http://www.mail-archive.com/netdev@vger.kernel.org/msg58612.html  
"

Comment 9 rob_thomas 2009-10-02 13:34:21 UTC
Created attachment 363482 [details]
sosreport of other labs failing system

Sosreport of other labs failing system.  I have also been able to use the accelerated traffic test program to induce failure in my lab on a RHEL5U3 install, but not RHEL5U4.  I have asked the other lab that reported this error to perform a fresh install of RHEL5U4 to see if the error still exists.

Rob

Comment 10 Andy Gospodarek 2009-10-16 21:32:10 UTC
OK, so there are a few things to think about as I look a this:

The bonding mode used is round-robin (mode 0), so every frame that comes into the box will be passed up the stack rather than some of the broadcast and multicast frames getting dropped as would happen in some of the other modes.

Not only will every frame make it's way into the box, but every frame will be received as if it came into the box on bond0.  In a configuration with 4 interfaces in the bond, the ipv6 code will actually appear to receive 4 frames each time one is received.

Comment #6 also indicates that this was not seen when bonding was disabled.  That run happened over a 24 hours period and most failures with bonding seemed to happen in around 3 hours.  I consider those to be fair statements, but I am curious if a non-bonding configuration was ever tested with the application mentioned in comment #6?

My suspicion about this deadlock is that it is timing related and the method by which frames are received by interfaces when they are included in a bond is making this appear more quickly.  We also may be able to more reliably reproduce this if more interfaces are added to the bond.

Rob, can you use your test program on a non-bonding configuration and see if you can reproduce the problem?  If you are able to reproduce it that way, I think it would help us narrow down whether this is an ipv6 interaction with a virtual device like a bonded interface or if this is purely a stack problem.

Thanks!

Comment 11 rob_thomas 2009-11-11 20:06:11 UTC
I do not get the error with non bonded interfaces.  The other lab has seen the error with RHEL5U4 and bonding-rr mode.

Nov 10 22:31:25 server kernel: BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
Nov 10 22:31:25 server kernel:
Nov 10 22:31:25 server kernel: Pid: 0, comm:              swapper
Nov 10 22:31:25 server kernel: EIP: 0060:[<c05bd9f8>] CPU: 0
Nov 10 22:31:25 server kernel: EIP is at dst_destroy+0x0/0xb2
Nov 10 22:31:25 server kernel:  EFLAGS: 00000246    Not tainted  (2.6.18-164.el5PAE #1)
Nov 10 22:31:25 server kernel: EAX: f460b680 EBX: 00000000 ECX: 00000001 EDX: f460b680
Nov 10 22:31:25 server kernel: ESI: c07e21e4 EDI: c05bdb09 EBP: 0000000a DS: 007b ES: 007b
Nov 10 22:31:25 server kernel: CR0: 8005003b CR2: b7ef1d2c CR3: 376a2f80 CR4: 000006f0
Nov 10 22:31:25 server kernel:  [<c05bdb57>] dst_run_gc+0x4e/0xee
Nov 10 22:31:25 server kernel:  [<c042c7cc>] run_timer_softirq+0xfb/0x151
Nov 10 22:31:25 server kernel:  [<c04292fb>] __do_softirq+0x87/0x114
Nov 10 22:31:25 server kernel:  [<c04073bb>] do_softirq+0x52/0x9c
Nov 10 22:31:25 server kernel:  [<c04059d7>] apic_timer_interrupt+0x1f/0x24
Nov 10 22:31:25 server kernel:  [<c052aa90>] acpi_safe_halt+0x14/0x20
Nov 10 22:31:25 server kernel:  [<c052ac30>] acpi_processor_idle+0x13e/0x386
Nov 10 22:31:25 server kernel:  [<c0403ca8>] cpu_idle+0x9f/0xb9
Nov 10 22:31:25 server kernel:  [<c06fd9f0>] start_kernel+0x37b/0x383

Comment 12 Andy Gospodarek 2010-02-03 18:09:11 UTC
I traced this back and discovered that ip6_dst_lookup_tail would crash on line 834 shown here (may be different depending on rhel release), when (*dst)->neighbor was NULL.

 824 
 825 #ifdef CONFIG_IPV6_OPTIMISTIC_DAD
 826                 /*
 827                  * Here if the dst entry we've looked up
 828                  * has a neighbour entry that is in the INCOMPLETE
 829                  * state and the src address from the flow is
 830                  * marked as OPTIMISTIC, we release the found
 831                  * dst entry and replace it instead with the
 832                  * dst entry of the nexthop router
 833                  */
 834                 if (!((*dst)->neighbour->nud_state & NUD_VALID)) {
 835                         struct inet6_ifaddr *ifp;
 836                         struct flowi fl_gw;
 837                         int redirect;
 838 
 839                         ifp = ipv6_get_ifaddr(&fl->fl6_src, (*dst)->dev, 1);

It seemed that adding a check before then would probably be adequate and after playing with return codes and other values it seemed just adding a check to the 'if' in line 834 was the best.

Apparently upstream agreed as I found this upstream commit:

commit e550dfb0c2c31b6363aa463a035fc9f8dcaa3c9b
Author: Neil Horman <nhorman>
Date:   Tue Sep 9 13:51:35 2008 -0700

    ipv6: Fix OOPS in ip6_dst_lookup_tail().

I backported this and I am testing it now.  I will provide the backported patch when it runs the test I have been given for around 1 hour without crashing.  The -185 kernel would normally crash in <5 minutes.

Here is the bonding configuration on my host that can normally reproduce the problem quickly:

# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008)

Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 1000
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:10:18:36:0a:d4

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:10:18:36:0a:d6

# ifconfig bond0 
bond0     Link encap:Ethernet  HWaddr 00:10:18:36:0A:D4  
          inet6 addr: fe80::210:18ff:fe36:ad4/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:1 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:82 (82.0 b)  TX bytes:176 (176.0 b)


NOTE: This test does fill dmesg with the following messages:

ICMPv6 RA: ndisc_router_discovery() failed to add default route.
printk: 369 messages suppressed.
ICMPv6 RA: ndisc_router_discovery() failed to add default route.
printk: 493 messages suppressed.
ICMPv6 RA: ndisc_router_discovery() failed to add default route.
printk: 567 messages suppressed.
ICMPv6 RA: ndisc_router_discovery() failed to add default route.
printk: 543 messages suppressed.
ICMPv6 RA: ndisc_router_discovery() failed to add default route.
printk: 425 messages suppressed.
ICMPv6 RA: ndisc_router_discovery() failed to add default route.

but this is a synthetic test, so I have no problem with these being in the log.

Comment 13 Andy Gospodarek 2010-02-03 18:14:44 UTC
Created attachment 388583 [details]
bonding-ipv6-fixup.patch

This patch should resolve this issue.  Please test it and report back any feedback you have.  This will apply on both 2.6.18-164 and 2.6.18-186.

Comment 14 Andy Gospodarek 2010-02-03 18:20:01 UTC
This patch will resolve panics that look like this:

Unable to handle kernel NULL pointer dereference at 0000000000000024 RIP:
 [<ffffffff8027485e>] __xfrm_lookup+0x6c/0x4a8
PGD 0
Oops: 0000 [1] SMP
last sysfs file: /class/misc/autofs/dev
CPU 0
Modules linked in: autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) bonding(U) 8021q(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_state(U) ip_conntrack(U) n)
Pid: 2929, comm: sshd Tainted: G      2.6.18-prep #1
RIP: 0010:[<ffffffff8027485e>]  [<ffffffff8027485e>] __xfrm_lookup+0x6c/0x4a8
RSP: 0018:ffff81002e65fd38  EFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff81002ec0ee28 RCX: 0000000000000001
RDX: ffff81002ec0eb40 RSI: ffff81002e65fe08 RDI: ffff81002e65fe80
RBP: ffff81002ec0eb40 R08: ffffffff80310d28 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000080 R12: 0000000000000000
R13: 0000000000000000 R14: ffff81002e65fe08 R15: 0000000000000000
FS:  00002ba63d5b5ea0(0000) GS:ffffffff803c9000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000024 CR3: 000000002f664000 CR4: 00000000000006e0
Process sshd (pid: 2929, threadinfo ffff81002e65e000, task ffff81003f540040)
Stack:  000000012ec0ee48 ffff81002ec0eb40 ffff81002e65fe80 ffff810000000001
 00020011885b6000 0000000000000002 0000000000000000 000000008023bd0c
 0000000000020011 0000000000000000 ffff810000000000 000000000000001c
Call Trace:
 [<ffffffff8857bf85>] :ipv6:ip6_datagram_connect+0x348/0x507
 [<ffffffff80237472>] net_random+0x18/0x1b
 [<ffffffff80065b29>] _spin_lock_bh+0x9/0x14
 [<ffffffff80031101>] release_sock+0x13/0xaa
 [<ffffffff80226c65>] sys_connect+0x7e/0xae
 [<ffffffff800b8354>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

but I am not seeing any of the soft-lockups when running v6blast with the latest RHEL5.5 kernel.

Please test the patch attached in comment #13 against the latest kernel here:

http://people.redhat.com/jwilson/el5/

and let me know if you can still reproduce the soft-lockups.

Comment 16 Andy Gospodarek 2010-02-09 02:29:52 UTC
Rob, any update for me?  I would like to get this patch in comment #13 included if it resolves your issue and we are down to the wire on this one.

Comment 17 rob_thomas 2010-02-10 17:25:43 UTC
Hi Andy,

I'm currently attempting to get our sister lab to test this as their lab is the only lab that has seen this problem.  As of U4, I could never make it fail.

Comment 20 Eugene Teo (Security Response) 2010-02-11 05:05:27 UTC

*** This bug has been marked as a duplicate of bug 563781 ***

Comment 21 rob_thomas 2010-02-16 16:24:06 UTC
Hi Andy,

We were able to make U5 Beta fail with v6blast and 8 bonded NIC's.  However, the parameters have changed when invoking v6blast:

v6blast -i eth0 -r 30 -l 1800

Instead of 300 packets per second, 30 packets per second invokes the CPU soft lockup.

Rob

Comment 22 Andy Gospodarek 2010-02-16 18:43:37 UTC
Rob, thanks for letting me know it is still a problem.  I'm going to re-open this as for now.

Did you ever test with the patch in comment #13 applied on top of 5.4 or the latest 5.5 kernel?

Comment 23 rob_thomas 2010-02-16 19:13:54 UTC
Both kernels were tested.  U4 patched and the kernel from comment #13.

Rob

Comment 24 rob_thomas 2010-02-16 19:14:48 UTC
Sorry, kernel from comment #14.

Rob

Comment 25 Andy Gospodarek 2010-02-16 19:21:00 UTC
(In reply to comment #23)
> Both kernels were tested.  U4 patched and the kernel from comment #13.
> 
> Rob    

OK, good to know.

Comment 30 rob_thomas 2010-02-26 21:52:15 UTC
Hi Andy.  Were you able to reproduce?

Comment 31 Andy Gospodarek 2010-03-02 03:22:43 UTC
Rob, I was not, but I will revert my hardware back to the exact setup that was and give it another try.

Comment 32 rob_thomas 2010-03-02 15:55:50 UTC
Add as many NIC's as you can to the bond, that seems to be the key differentiator.

Rob

Comment 33 Andy Gospodarek 2010-03-02 22:38:15 UTC
I've got 3 NICs in a bond and though I get many of these messages:

bond0: IPv6 duplicate address detected!

I'm not getting any soft-lockups.  Rob can you paste the soft-lockup messages?

Comment 34 rob_thomas 2010-03-02 22:51:43 UTC
Hi Andy,

It appears that this issue may be limited to Intel Quad port NIC's.  I cannot reproduce with Snapshot 2 and Broadcom.  I'm in the process of reconfiguring to Intel Quad port NIC's and will update late tomorrow afternoon with the results.

Rob

Comment 35 Andy Gospodarek 2010-03-03 02:20:41 UTC
Rob, that is an interesting data-point.

Was this noticed on e1000e or igb-based NICs?  I've got dual-port flavors of cards that are supported by those drivers, so I can test them a bit. 

Also if this fails on a specific dual-port LOM on a Dell system let me know and I can see if we have one in our lab.

Comment 36 rob_thomas 2010-03-03 20:53:44 UTC
I've duplicated the error on all Broadcom, it just takes hours versus minutes for the Intel quad port NICs.  The driver is igb-based with the 82575 chipset.  This was with snapshot 2.  I'm in the process of hooking of a serial port to capture the output because the error doesn't make it to messages.

Comment 37 rob_thomas 2010-03-03 21:22:08 UTC
Created attachment 397665 [details]
CPU stuck for 10s! log

These are the printks when using Intel NICs with bonded IPv6.

Comment 41 Andy Gospodarek 2010-04-14 22:04:57 UTC
Rob, have you ever run one of the debug kernels to see if it kicks out any interesting information regarding these locks?

Comment 42 rob_thomas 2010-04-15 15:53:53 UTC
Created attachment 406842 [details]
patches for ipv6 bonding issue.

Comment 43 rob_thomas 2010-04-15 16:00:25 UTC
Hi,

I've attached some patches for this issue.  Basically it appears dst is trying to update and remove the same route due to duplicate concurrent router updates being recieved from the bond interface.  The patch for dst.c is ported from 2.6.33.  The patch first appeared in 2.6.24.

http://www.mail-archive.com/netdev@vger.kernel.org/msg47489.html

I also placed a spin_lock_bh around ip6_route_input to serialize the route updates.  IPv4 does a rcu_read_lock on ip_route_input.

Enabling cache debug causes a kernel panic and I have not spent any time on this though.

ROb

Comment 44 Andy Gospodarek 2010-04-30 18:01:18 UTC
This patch seems to be the one referenced:

commit 86bba269d08f0c545ae76c90b56727f65d62d57f
Author: Eric Dumazet <dada1>
Date:   Wed Sep 12 14:29:01 2007 +0200

    [PATCH] NET : convert IP route cache garbage collection from softirq processing to a workqueue

Comment 45 Andy Gospodarek 2010-04-30 19:55:23 UTC
Rob and Shyam, I have 3 questions:

1.  When running the patches included in comment #42 with does this problem appear to be fixed or not?  (If so, but enabling CACHE_DEBUG >= 2 is still an issue, we can work around that.)

2.  If these patches resolve the issue, was the patch named 'ipv6_route_c.patch' required for success?

2.  I noticed that that the patches in comment #42 include at least the following upstream commits (which is basically all that is upstream that we can take and are not whitespace fixes):

commit 2fc1b5dd99f66d93ffc23fd8df82d384c1a354c8
Author: Eric Dumazet <eric.dumazet>
Date:   Mon Feb 8 15:00:39 2010 -0800

    dst: call cond_resched() in dst_gc_task()

commit ef711cf1d156428d4c2911b8c86c6ce90519dc45
Author: Eric Dumazet <dada1>
Date:   Fri Nov 14 00:53:54 2008 -0800

    net: speedup dst_release()

commit f262b59becc3f557da6460232abac13706402849
Author: Benjamin Thery <benjamin.thery>
Date:   Fri Sep 12 16:16:37 2008 -0700

    net: fix scheduling of dst_gc_task by __dst_free

commit 8d3308687f7f1eaa1bb5d202d14752d5f90068eb
Author: Ilpo Järvinen <ilpo.jarvinen>
Date:   Thu Mar 27 17:53:31 2008 -0700

    [NET]: uninline dst_release

commit 64b7d96167977850f4a24e52dd0a76b03c6542cf
Author: Eric Dumazet <dada1>
Date:   Tue Dec 11 02:00:30 2007 -0800

    [NET]: dst_ifdown() cleanup

commit 86bba269d08f0c545ae76c90b56727f65d62d57f
Author: Eric Dumazet <dada1>
Date:   Wed Sep 12 14:29:01 2007 +0200

    [PATCH] NET : convert IP route cache garbage collection from softirq processing to a workqueue

were all of those required?  I have ideas of which ones would not be needed, but I'm curious if you tried them all first or what the process was.

Thanks!

Comment 46 rob_thomas 2010-04-30 20:12:31 UTC
1.  Yes
2.  Yes.  

Just putting a spin_lock_bh around the ip6_route_update was not enough nor was moving dst to a workqueue, other panics or soft lockups would occur.

I did not piecemeal patch by patch.  I noticed that 2.6.33 was working with the synthetic test (v6blast).  So I poked around and noticed the patch where dst was changed to use workqueues and just used dst from 2.6.33.

I ran v6blast for 7 days here against the patches I attached.  V6blast is not very reliable as I have been fooled into a sense of security in the past.  I tested this in a real environment last week overnight and the patches were successfull.  The real environment could cause the soft lockup in minutes, but always failed overnight.

Comment 47 Andy Gospodarek 2010-04-30 20:37:34 UTC
Thanks, Rob.  Based on your success with 2.6.33 and the fact that there is no locking/serialization around fib6_rule_lookup in ip6_route_input upstream today, would you be willing to drop the changes from ipv6_route_c.patch and give that a try in your real environment?  As much as I've tried I still cannot reproduce this with v6blast (which you've also understandably stated isn't always as trustworthy as your real environment).  I would like to know that the lock is not needed, but if it is, I would like to push it upstream and copy this to our other distros.

For the record, I would be happy to take all of the changes I listed in comment #45 (which should match your changes), but I suspect that this:

commit 2fc1b5dd99f66d93ffc23fd8df82d384c1a354c8
Author: Eric Dumazet <eric.dumazet>
Date:   Mon Feb 8 15:00:39 2010 -0800

    dst: call cond_resched() in dst_gc_task()

    Kernel bugzilla #15239

    On some workloads, it is quite possible to get a huge dst list to
    process in dst_gc_task(), and trigger soft lockup detection.

    Fix is to call cond_resched(), as we run in process context.

    Reported-by: Pawel Staszewski <pstaszewski>
    Tested-by: Pawel Staszewski <pstaszewski>
    Signed-off-by: Eric Dumazet <eric.dumazet>
    Signed-off-by: David S. Miller <davem>

diff --git a/net/core/dst.c b/net/core/dst.c
index 57bc4d5..cb1b348 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -17,6 +17,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <net/net_namespace.h>
+#include <linux/sched.h>

 #include <net/dst.h>

@@ -79,6 +80,7 @@ loop:
        while ((dst = next) != NULL) {
                next = dst->next;
                prefetch(&next->next);
+               cond_resched();
                if (likely(atomic_read(&dst->__refcnt))) {
                        last->next = dst;
                        last = dst;



is the patch that resolves the soft-lockup.

Thanks for your work on this.

Comment 48 rob_thomas 2010-05-03 18:44:02 UTC
The changes in ip6_route.c, i.e., spin_lock_bh was needed as another kernel panic would happen elsewhere in the ip6 code.  I did not debug this either.

cond_resched() did not make a difference for this issue.  cond_resched() was removed between 2.6.24 and 2.6.33.  I have no hard facts for cond_resched() to be included or excluded.  Personally, I would keep cond_resched() in since it was originally put there to resolve another type soft lockup.

Comment 49 Andy Gospodarek 2010-05-03 19:20:14 UTC
Thanks, Rob.  I still see the cond_resched() in the latest net-next-2.6 tree, so it still must serve it's intended purpose.  I know this will sound frustrating (and I appreciate all that you have done to debug this), but I should must understand the different panic you describe when not using the spin_[un]lock_bh as shown in ipv6_route_c.patch.  There must be another fix for this upstream if you did not see any problems when running 2.6.33, so I would rather incorporate that fix as it is consistent with upstream.  Maybe we should see if Shyam could help reproduce this in-house?

Comment 50 rob_thomas 2010-05-03 20:03:12 UTC
Yeah, get Shyam on it.

Comment 51 Shyam Iyer 2010-05-04 16:27:43 UTC
Just opened a ticket to create isolated network.

Comment 52 Shyam Iyer 2010-05-04 21:29:22 UTC
As I get the setup working...

Just a piece of info. Upstream has reported similar problems around dst_cache locking so I don't believe upstream is completely alleviated of this problem.

Reference link:

http://kerneltrap.org/mailarchive/linux-netdev/2010/4/20/6275107

Comment 53 Andy Gospodarek 2010-05-05 21:24:31 UTC
Shyam, I'm fine to take an alternative solution that is upstream.  I would just rather not take the patch without understanding if it is broken upstream and how we can fix it there too.

Comment 54 Shyam Iyer 2010-05-12 19:42:02 UTC
Andy,

I got the call trace reproducing inhouse ...

Please check dell-per710-01.lab.bos.redhat.com.

Only console/drac access is available as it is on an isolated network.

I used the following option to synthesize it ..

#v6blast -i eth0 -r 100 -l 1800

v6blast is a flooding tool that floods router advertisements..

Thanks,
Shyam

Comment 55 Shyam Iyer 2010-05-12 19:47:49 UTC
Created attachment 413540 [details]
Trace log

Comment 56 Andy Gospodarek 2010-05-12 20:33:47 UTC
Based on upstream activity, Herbert may find this one interesting.

Comment 57 Andy Gospodarek 2010-05-12 21:11:01 UTC
Created attachment 413565 [details]
516985-test0.patch

Shyam, here is a patch against 2.6.18-194 that should have all of the current upstream patches in net/core/dst.c that we could take.  I suspect you will be able to reproduce the problem, but please let me know.  Thanks!

Comment 58 Shyam Iyer 2010-05-13 20:40:43 UTC
Created attachment 413892 [details]
panic trace with upstream backport

So, the upstream backport had a kernel panic as opposed to the deadlock that I observed with the 194 kernel.

Comment 59 rob_thomas 2010-05-13 22:17:53 UTC
If you place the spin_lock_bh in ip6_route.c the panic will go away.

--- linux-2.6.18.i686.rob/net/ipv6/route.c	2010-03-18 14:25:52.000000000 -0500
+++ linux-2.6.18.i686/net/ipv6/route.c	2010-04-12 15:42:30.000000000 -0500
@@ -730,6 +730,9 @@
 	return rt;
 }
 
+// Fix for IPv6 softlockup -- ROb 31MAR10
+DEFINE_SPINLOCK(route_update_lock);
+
 void ip6_route_input(struct sk_buff *skb)
 {
 	struct ipv6hdr *iph = skb->nh.ipv6h;
@@ -752,7 +755,13 @@
 	if (rt6_need_strict(&iph->daddr))
 		flags |= RT6_LOOKUP_F_IFACE;
 
+// Place spin lock around route updates to serialize.
+// This is to battle duplicate router update packets
+// arriving from bonding interfaces.
+// Part of fix for IPv6 soft lockup - ROb - 31MAR10
+	spin_lock_bh(&route_update_lock);
 	skb->dst = fib6_rule_lookup(&fl, flags, ip6_pol_route_input);
+	spin_unlock_bh(&route_update_lock);
 }
 
 static struct rt6_info *ip6_pol_route_output(struct fib6_table *table,

Comment 60 Shyam Iyer 2010-05-13 22:28:19 UTC
I was actually compiling with this change to see if that works will let know the results shortly..

Comment 61 Shyam Iyer 2010-05-14 16:55:14 UTC
So far so good... I have been running this test for the past couple of hours and no kernel panic/trace yet.

Comment 62 Herbert Xu 2010-05-17 09:20:42 UTC
Thanks for notifying me Andy.  I am working on this upstream.  The synchronisation will be added at both the addrconf layer (to prevent races between adminstrative actions and remotely triggered actions), and at the ndisc layer.

I expect to resolve the original problem this week, followed by a general audit of ndisc.c.

Comment 63 Andy Gospodarek 2010-05-17 16:21:30 UTC
Thank you too, Herbert.  Most of the ipv6 code is outside my wheel-house, so I (as well as those at Dell) appreciate you looking at this.

I'm going to go ahead and assign this to you since it will likely be your fix that ultimately resolves this.

Comment 64 Shyam Iyer 2010-05-20 17:46:11 UTC
Herbert,

Thanks for posting the patches upstream.

I am going to run them here in this isolated network to simulate the..

Thanks,
Shyam

Comment 65 Shyam Iyer 2010-07-13 22:13:29 UTC
Herbert,

I backported the upstream patch set and it fixed the problem here.

Were you also working on the backport of the issue?
If not I will be happy to take the issue as I was just going to brew built it for review.

Thanks,
Shyam

Comment 66 Herbert Xu 2010-07-14 00:25:18 UTC
Shayam, feel free to post them for review.  Thanks!

Comment 67 Shyam Iyer 2010-07-14 16:35:22 UTC
Created attachment 431837 [details]
Use states in DAD logic to prevent the IPv6 lockups

Comment 69 Andy Gospodarek 2010-07-14 19:30:25 UTC
Shyam, let me know if you cannot see comment #68.

Comment 70 Shyam Iyer 2010-07-14 19:40:02 UTC
Yes.. I can't see comment #68

Comment 71 RHEL Program Management 2010-07-15 18:59:13 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 73 Jarod Wilson 2010-08-02 21:47:50 UTC
in kernel-2.6.18-210.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 75 Chris Ward 2010-11-09 13:32:52 UTC
~~ Attention Customers and Partners - RHEL 5.6 Public Beta is now available on RHN ~~

A fix for this 'OtherQA' BZ should be present and testable in the release. 

If this Bugzilla is verified as resolved, please update the Verified field above with an appropriate value and include a summary of the testing executed and the results obtained.

If you encounter any issues or have questions while testing, please describe them and set this bug into NEED_INFO. 

If you encounter new defects or have additional patches to request for inclusion, promptly escalate the new issues through your support representative.

Finally, future Beta kernels can be found here:
 http://people.redhat.com/jwilson/el5/

Note: Bugs with the 'OtherQA' keyword require Third-Party testing to confirm the request has been properly addressed. See: https://bugzilla.redhat.com/describekeywords.cgi#OtherQA ).

Comment 77 Raghavendra Biligiri 2010-12-16 10:20:28 UTC
Patch mentioned in comment#67 has been included in RHEL5.6-Snapshot4.

Comment 79 errata-xmlrpc 2011-01-13 20:52:35 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.