Bug 216895

Summary:

BUG: bringing up balanced-alb mode bond network interface

Product:

Red Hat Enterprise Linux 4

Reporter:

jordan hargrave <jordan_hargrave>

Component:

kernel

Assignee:

Andy Gospodarek <agospoda>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

urgent

Docs Contact:

Priority:

medium

Version:

4.4

CC:

jbaron, jfeeney, jordan_hargrave, linville, peterm, wwlinuxengineering

Target Milestone:

---

Keywords:

Regression

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2007-0304

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-05-08 04:14:55 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

210577

Bug Blocks:

200936

Attachments:

Description	Flags
diff output	none
bonding hang rhel4.5 beta1 kernel.	none
bonding driver workqueue based.	none
proposed patch	none
bonding module's modinfo output	none
tg3 module's modinfo output	none

Description jordan hargrave 2006-11-22 16:13:52 UTC

+++ This bug was initially created as a clone of Bug #210577 +++

Description of problem:
When bringing up a bond interface in balanced-alb mode, I'm greeted with a ton
of BUG: spew. See attached file for full text dump.

Notes:
1) The same bond comes up fine in active-backup mode.
2) Happens with both a 2x 3c59x/1x e100 setup and a 3x e1000 setup
3) The bond actually does come up and function, despite the spew

Version-Release number of selected component (if applicable):
kernel-2.6.18-1.2725.el5
initscripts-8.44-1[*]

[*] same results with and without patch from bug 202443 applied

How reproducible:
Configure a bond interface in balanced-alb mode and fire it up.

# modprobe.conf bits
alias bond0 bonding
options bond0 mode=balance-alb miimon=100
alias eth0 e1000
alias eth1 e1000
alias eth2 e1000

# Intel Corporation 82541PI Gigabit Ethernet Controller
DEVICE=eth0
ONBOOT=yes
HWADDR=00:0e:0c:b3:a0:79
MASTER=bond0
SLAVE=yes
BOOTPROTO=none

# Intel Corporation 82541PI Gigabit Ethernet Controller
DEVICE=eth1
ONBOOT=yes
HWADDR=00:0e:0c:b3:a0:64
MASTER=bond0
SLAVE=yes
BOOTPROTO=none

# Intel Corporation 82541PI Gigabit Ethernet Controller
DEVICE=eth2
ONBOOT=yes
HWADDR=00:0e:0c:b3:a0:70
MASTER=bond0
SLAVE=yes
BOOTPROTO=none

# Bonding device
DEVICE=bond0
ONBOOT=yes
BOOTPROTO=dhcp
TYPE=Bonding
USERCTL=no

-- Additional comment from jwilson on 2006-10-12 17:53 EST --
Created an attachment (id=138385)
Spew from dmesg after bond0 bring-up


-- Additional comment from agospoda on 2006-10-13 14:38 EST --
It seems that since alloc_skb() is called with the flag GFP_KERNEL in
rtmsg_ifinfo() we get this error.

-- Additional comment from agospoda on 2006-10-13 14:58 EST --
Looks like this is basically a problem upstream as well.  alloc_skb isn't called
directly from rtmsg_info -- there is an additional call to nlmsg_new thrown in
for fun -- but it still has the GFP_KERNEL flag.  I haven't verified it yet with
testing but it seems like the same should happen.  

-- Additional comment from agospoda on 2006-10-13 17:46 EST --
So having the wrong flag in alloc_skb is problematic, but its not the real
issue.  The complaint seems to be that we are accepting of calls that might
sleep the current context.  More to come next week....

-- Additional comment from agospoda on 2006-10-17 12:34 EST --
The fact that in_atomic returns true:

BUG: sleeping function called from invalid context at mm/slab.c:2948
in_atomic():1, irqs_disabled():0

Seems to be the real culprit.  Here's the definition for it:

# define in_atomic()    ((preempt_count() & ~PREEMPT_ACTIVE) != 0)

I put a git kernel from last week on the box (2.6.19-rc1) and it seems to have
the same issue when using a RHEL5 config.  I'm not sure I understand the
significance of AND'ing the current preemtion count with ~PREEMPT_ACTIVE, but
I'll poke around some more and see what I find.  It does seem like the intent is
to require that all operations are atomic since preemption is disabled in
softirq code

I've also noticed that we could consider setting CONFIG_PREEMPT_BKL=n and see if
that has an effect.  Will test later today or tomorrow.



-- Additional comment from linville on 2006-10-23 15:15 EST --
*** Bug 207443 has been marked as a duplicate of this bug. ***

-- Additional comment from agospoda on 2006-10-24 14:40 EST --
There are really 2 parts to this problem.

1 - Code in rtnetlink.c needs to use create messages with GFP_ATOMIC flag rather
than GFP_KERNEL.  This is easy and is already done.

2 - ASSERT_RTNL needs to be converted to make atomic operations available.  This
will be sligthly more time consuming and is not done yet.  I have verified with
some testing on upstream kernels last week that this problem still exists and
preventing ASSERT_RTNL calls from being made unless absolutely necessary also
resolves this.  Recoding ASSERT_RTNL is a better long-term solution.


-- Additional comment from pm-rhel on 2006-10-27 14:00 EST --
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

-- Additional comment from agospoda on 2006-10-31 10:37 EST --
Created an attachment (id=139866)
bz210577-upstream.patch

If I apply the attached patch, the output changes to that complaints are no
longer	related to allocations of type GFP_KERNEL, they now reference the fact
that that rtnl_lock is taken in s bh/soft_irq context.	Here's the new log. 

BUG: sleeping function called from invalid context at kernel/mutex.c:86
in_atomic():1, irqs_disabled():0
 [<c02c5574>] mutex_lock+0x15/0x23
 [<c0102d42>] common_interrupt+0x1a/0x20
 [<c0272bd2>] netdev_run_todo+0x10/0x1e7
 [<c02a7507>] inetdev_event+0x29/0x37e
 [<c01b932b>] _raw_spin_lock+0xb2/0xca
 [<c02c616d>] _spin_unlock_bh+0x5/0xd
 [<c0285b96>] rt_run_flush+0x65/0x8c
 [<c02c7693>] notifier_call_chain+0x19/0x29
 [<c0270b64>] dev_set_mac_address+0x46/0x4b
 [<f8bac1d7>] alb_set_slave_mac_addr+0x5a/0x7f [bonding]
 [<f8ba60a9>] bond_update_speed_duplex+0x2f/0xd7 [bonding]
 [<f8bac5bf>] alb_swap_mac_addr+0x88/0x134 [bonding]
 [<f8ba7b04>] bond_change_active_slave+0x185/0x29d [bonding]
 [<f8ba7eee>] bond_select_active_slave+0xa5/0xd5 [bonding]
 [<f8ba998a>] bond_mii_monitor+0x376/0x3be [bonding]
 [<f8ba9614>] bond_mii_monitor+0x0/0x3be [bonding]
 [<c011a45c>] run_timer_softirq+0xea/0x13c
 [<c0117cd7>] __do_softirq+0x35/0x75
 [<c0117d39>] do_softirq+0x22/0x26
 [<c0103e47>] do_IRQ+0x66/0x77
 [<c0102d42>] common_interrupt+0x1a/0x20
 [<c01ec211>] acpi_processor_idle+0x1a8/0x33b
 [<c0101a6f>] cpu_idle+0x39/0x4e
 [<c037a61c>] start_kernel+0x285/0x287
 =======================
RTNL: assertion failed at net/ipv4/devinet.c (1054)
 [<c02a7520>] inetdev_event+0x42/0x37e
 [<c01b932b>] _raw_spin_lock+0xb2/0xca
 [<c02c616d>] _spin_unlock_bh+0x5/0xd
 [<c0285b96>] rt_run_flush+0x65/0x8c
 [<c02c7693>] notifier_call_chain+0x19/0x29
 [<c0270b64>] dev_set_mac_address+0x46/0x4b
 [<f8bac1d7>] alb_set_slave_mac_addr+0x5a/0x7f [bonding]
 [<f8ba60a9>] bond_update_speed_duplex+0x2f/0xd7 [bonding]
 [<f8bac5bf>] alb_swap_mac_addr+0x88/0x134 [bonding]
 [<f8ba7b04>] bond_change_active_slave+0x185/0x29d [bonding]
 [<f8ba7eee>] bond_select_active_slave+0xa5/0xd5 [bonding]
 [<f8ba998a>] bond_mii_monitor+0x376/0x3be [bonding]
 [<f8ba9614>] bond_mii_monitor+0x0/0x3be [bonding]
 [<c011a45c>] run_timer_softirq+0xea/0x13c
 [<c0117cd7>] __do_softirq+0x35/0x75
 [<c0117d39>] do_softirq+0x22/0x26
 [<c0103e47>] do_IRQ+0x66/0x77
 [<c0102d42>] common_interrupt+0x1a/0x20
 [<c01ec211>] acpi_processor_idle+0x1a8/0x33b
 [<c0101a6f>] cpu_idle+0x39/0x4e
 [<c037a61c>] start_kernel+0x285/0x287

I plan to take the attached patch upstream today for comments on it and the
rest of this problem.  These mesages will go away with the GOLD (and maybe
Beta2) build since we will compile without:

CONFIG_DEBUG_SPINLOCK_SLEEP=y

in the default config, but I'd like to eliminate these problems anyway.

-- Additional comment from amit_bhutani on 2006-10-31 15:21 EST --
Adding Jordan for his $0.02

-- Additional comment from jordan_hargrave on 2006-11-02 10:21 EST --
The monitoring threads should be moved to workqueues instead of timers.
However, all spin/readlock functions should be switched to the _bh forms.
otherwise the network bottom halves can run to transmit a packet with a held 
lock and causes a system hang


-- Additional comment from amit_bhutani on 2006-11-03 00:07 EST --
Bumping the severity up based on comment #11.

-- Additional comment from sbenjamin on 2006-11-03 10:27 EST --
Need DEV/QE ACKS on this high priority bug to pull it into rhel5.
Andy, please review Dell (Jordan's) comments. 

Thanks.

-- Additional comment from agospoda on 2006-11-13 17:26 EST --
I have converted mii monitoring to use work_queues on upstream kernels and hope
to have it fully tested and ready for submission later this week.  I want to do
some profiling first and make sure we will not see a significant drop in the
time it takes to respond to link-down events by moving to a work_queue.  

I will post more updates to the ticket as well as links to rhel5 test kernels
when I've got one to test.

-- Additional comment from agospoda on 2006-11-17 17:01 EST --
I've made some more progress, but I want to do some more testing because I've
started to see some problems with the porting I did for the sysfs portion of the
bonding code.  

Just so its clear -- everyone should know that the messages won't show up in
anything after beta1 since beta2 has several of the CONFIG_DEBUG spinlock
options turned off.  That is NOT an attempt to trivialize this problem by any
means, but the cosmetic effect of it will not appear in beta2 and beyond.

Comment 1 jordan hargrave 2006-11-22 16:15:51 UTC

Note.. same issue occurs with RHEL4.4 kernel as with RHEL5 kernel.  This issue
is for tracking RHEL4.4

Comment 2 Charles Rose 2006-11-30 15:12:31 UTC

This is a High severity issue for Dell. Request Urgent action on this one.

Comment 3 Andy Gospodarek 2006-11-30 21:59:28 UTC

Can you please be more specific about what you are seeing on rhel4.4?  There are
actually 2 errors in BZ 210577.

Ones like these:

BUG: sleeping function called from invalid context at mm/slab.c:2948
in_atomic():1, irqs_disabled():0
[<c04051ed>] show_trace_log_lvl+0x58/0x16a
[<c04057fa>] show_trace+0xd/0x10
[<c0405913>] dump_stack+0x19/0x1b
[<c041db63>] __might_sleep+0x8d/0x95
[<c0470446>] kmem_cache_alloc+0x28/0xb5
[<c05b5015>] __alloc_skb+0x2c/0xfa
[<c05c04c7>] rtmsg_ifinfo+0x21/0x6e
[<c05c054c>] rtnetlink_event+0x38/0x3c
[<c0615a2d>] notifier_call_chain+0x20/0x31
[<c0430a84>] raw_notifier_call_chain+0x8/0xa
[<c05b8011>] dev_set_mac_address+0x48/0x4e
[<f8d06f8c>] alb_set_slave_mac_addr+0x5d/0x83 [bonding]
[<f8d07725>] bond_alb_handle_active_change+0xb3/0xc9 [bonding]
[<f8d02b0f>] bond_change_active_slave+0x1a7/0x298 [bonding]
[<f8d0357b>] bond_select_active_slave+0x99/0xce [bonding]
[<f8d04886>] bond_mii_monitor+0x364/0x3ab [bonding]
[<c042da4b>] run_timer_softirq+0x108/0x167
[<c04290b3>] __do_softirq+0x78/0xf2
[<c0406683>] do_softirq+0x5a/0xbe
[<c0428f5c>] irq_exit+0x3d/0x3f
[<c04179cf>] smp_apic_timer_interrupt+0x73/0x78
[<c0404b12>] apic_timer_interrupt+0x2a/0x30

And ones like these:

RTNL: assertion failed at net/core/fib_rules.c (388)
[<c04051ed>] show_trace_log_lvl+0x58/0x16a
[<c04057fa>] show_trace+0xd/0x10
[<c0405913>] dump_stack+0x19/0x1b
[<c05c553d>] fib_rules_event+0x34/0xeb
[<c0615a2d>] notifier_call_chain+0x20/0x31
[<c0430a84>] raw_notifier_call_chain+0x8/0xa
[<c05b8011>] dev_set_mac_address+0x48/0x4e
[<f8d06f8c>] alb_set_slave_mac_addr+0x5d/0x83 [bonding]
[<f8d07725>] bond_alb_handle_active_change+0xb3/0xc9 [bonding]
[<f8d02b0f>] bond_change_active_slave+0x1a7/0x298 [bonding]
[<f8d0357b>] bond_select_active_slave+0x99/0xce [bonding]
[<f8d04886>] bond_mii_monitor+0x364/0x3ab [bonding]
[<c042da4b>] run_timer_softirq+0x108/0x167
[<c04290b3>] __do_softirq+0x78/0xf2
[<c0406683>] do_softirq+0x5a/0xbe
[<c0428f5c>] irq_exit+0x3d/0x3f
[<c04179cf>] smp_apic_timer_interrupt+0x73/0x78
[<c0404b12>] apic_timer_interrupt+0x2a/0x30

I suspect it is the latter, but I'd like to be sure.

Comment 4 jordan hargrave 2006-12-06 02:46:08 UTC

We are seeing both messages; they are caused when dev_set_mac_address is called
from the bond_mii_monitor thread.

Comment 5 Andy Gospodarek 2006-12-07 22:02:02 UTC

You are seeing this message?

BUG: sleeping function called from invalid context at mm/slab.c:2948
in_atomic():1, irqs_disabled():0

Comment 6 Andy Gospodarek 2006-12-07 22:57:50 UTC

Please just post the exact messages you are seeing?  I don't doubt that you
might be seeing something similar to the rhel5 issue but the bonding code for
rhel4 isn't the same as the code for rhel5 so there should be a different
backtrace.  Thanks.

Comment 7 Charles Rose 2007-01-24 14:22:59 UTC

We will post the messages shortly.

Comment 8 Shyam kumar Iyer 2007-02-17 11:44:01 UTC

Andy,
     I got  to reproduce the hang. It is not the rtnl lock though. It's simply 
because we are scheduling in atomic context.(sounds familiar right). The 
culprit this time is the new code which is not there in R4.4 native tg3 driver 
but in a tg3 driver version 3.70 version dkms driver that I am testing R4.4 
bonding drivers with. 
     The new code is the full chip reset which is done if ASF is enabled so 
that the driver's mac address is not overwritten by the ASF MAC address.
     The new code is present R5 so we can easily reproduce it there but not 
present in the native R4.4 code and so we are not able to reproduce the hang 
with the native tg3 driver.

The trace for your reference----



ip_tables: (C) 2000-2002 Netfilter core team
ip_tables: (C) 2000-2002 Netfilter core team
bad: scheduling while atomic!

Call Trace:<IRQ> <ffffffff80359a9f>{schedule+72} <ffffffff8013adad>
{printk+141} 
       <ffffffff8035b62f>{schedule_timeout+410} <ffffffff8014545f>
{process_timeout+0} 
       <ffffffffa00dafba>{:tg3:tg3_restart_hw+12} <ffffffffa00db5c6>
{:tg3:tg3_set_mac_addr+208} 
       <ffffffffa00c59f5>{:bonding:alb_set_slave_mac_addr+68} 
       <ffffffffa00d1000>{:tg3:tg3_write32+0} <ffffffffa00c5aba>
{:bonding:alb_swap_mac_addr+161} 
       <ffffffffa00bef81>{:bonding:bond_change_active_slave+520} 
       <ffffffffa00bf81d>{:bonding:bond_mii_monitor+944} <ffffffff80144d1d>
{run_timer_softirq+591} 
       <ffffffff80140164>{__do_softirq+76} <ffffffff801401eb>{do_softirq+49} 
       <ffffffff80113d8f>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} 
        <EOI> <ffffffff8010e886>{mwait_idle+85} <ffffffff8010e817>
{cpu_idle+26} 
       <ffffffff8054b6f8>{start_kernel+637} <ffffffff8054b1ab>{_sinittext+427} 
       
bad: scheduling from the idle thread!

Call Trace:<IRQ> <ffffffff80359ae9>{schedule+146} <ffffffff8013adad>
{printk+141} 
       <ffffffff8035b62f>{schedule_timeout+410} <ffffffff8014545f>
{process_timeout+0} 
       <ffffffffa00dafba>{:tg3:tg3_restart_hw+12} <ffffffffa00db5c6>
{:tg3:tg3_set_mac_addr+208} 
       <ffffffffa00c59f5>{:bonding:alb_set_slave_mac_addr+68} 
       <ffffffffa00d1000>{:tg3:tg3_write32+0} <ffffffffa00c5aba>
{:bonding:alb_swap_mac_addr+161} 
       <ffffffffa00bef81>{:bonding:bond_change_active_slave+520} 
       <ffffffffa00bf81d>{:bonding:bond_mii_monitor+944} <ffffffff80144d1d>
{run_timer_softirq+591} 
       <ffffffff80140164>{__do_softirq+76} <ffffffff801401eb>{do_softirq+49} 
       <ffffffff80113d8f>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} 
        <EOI> <ffffffff8010e886>{mwait_idle+85} <ffffffff8010e817>
{cpu_idle+26} 
       <ffffffff8054b6f8>{start_kernel+637} <ffffffff8054b1ab>{_sinittext+427} 
       
bad: scheduling while atomic!

Call Trace:<IRQ> <ffffffff80359a9f>{schedule+72} <ffffffff8035b62f>
{schedule_timeout+410} 
       <ffffffff8014545f>{process_timeout+0} <ffffffffa00dafba>
{:tg3:tg3_restart_hw+12} 
       <ffffffffa00db5c6>{:tg3:tg3_set_mac_addr+208} <ffffffffa00c59f5>
{:bonding:alb_set_slave_mac_addr+68} 
       <ffffffffa00d1000>{:tg3:tg3_write32+0} <ffffffffa00c5aba>
{:bonding:alb_swap_mac_addr+161} 
       <ffffffffa00bef81>{:bonding:bond_change_active_slave+520} 
       <ffffffffa00bf81d>{:bonding:bond_mii_monitor+944} <ffffffff80144d1d>
{run_timer_softirq+591} 
       <ffffffff80140164>{__do_softirq+76} <ffffffff801401eb>{do_softirq+49} 
       <ffffffff80113d8f>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} 
        <EOI> <ffffffff8010e886>{mwait_idle+85} <ffffffff8010e817>
{cpu_idle+26} 
       <ffffffff8054b6f8>{start_kernel+637} <ffffffff8054b1ab>{_sinittext+427} 
       
bad: scheduling from the idle thread!

Call Trace:<IRQ> <ffffffff80359ae9>{schedule+146} <ffffffff8035b62f>
{schedule_timeout+410} 
       <ffffffff8014545f>{process_timeout+0} <ffffffffa00dafba>
{:tg3:tg3_restart_hw+12} 
       <ffffffffa00db5c6>{:tg3:tg3_set_mac_addr+208} <ffffffffa00c59f5>
{:bonding:alb_set_slave_mac_addr+68} 
       <ffffffffa00d1000>{:tg3:tg3_write32+0} <ffffffffa00c5aba>
{:bonding:alb_swap_mac_addr+161} 
       <ffffffffa00bef81>{:bonding:bond_change_active_slave+520} 
       <ffffffffa00bf81d>{:bonding:bond_mii_monitor+944} <ffffffff80144d1d>
{run_timer_softirq+591} 
       <ffffffff80140164>{__do_softirq+76} <ffffffff801401eb>{do_softirq+49} 
       <ffffffff80113d8f>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} 
        <EOI> <ffffffff8010e886>{mwait_idle+85} <ffffffff8010e817>
{cpu_idle+26} 
       <ffffffff8054b6f8>{start_kernel+637} <ffffffff8054b1ab>{_sinittext+427} 
       
Unable to handle kernel paging request at 0000000000200200 RIP: 
<ffffffff801341a4>{dequeue_task+18}
PML4 31ec1067 PGD 31736067 PMD 0 
Oops: 0002 [1] 
CPU 0 
Modules linked in: md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) autofs4(U) 
i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U) pcmcia_core(U) 
dm_multipath(U) button(U) battery(U) ac(U) usb_storage(U) uhci_hcd(U) ehci_hcd
(U) tg3(U) bonding(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) ext3(U) jbd(U) 
dm_mod(U) ata_piix(U) libata(U) aacraid(U) sd_mod(U) scsi_mod(U)
Pid: 0, comm: swapper Not tainted 2.6.9-prep
RIP: 0010:[<ffffffff801341a4>] <ffffffff801341a4>{dequeue_task+18}
RSP: 0018:ffffffff804adc20  EFLAGS: 00010097
RAX: 0000000000100100 RBX: ffffffff80427880 RCX: ffffffff804278b8
RDX: 0000000000200200 RSI: 0000000000000000 RDI: ffffffff80427880
RBP: ffffffff804adc20 R08: 0000000000000005 R09: ffffffff8039241e
R10: 0000ffff80454060 R11: 0000ffff80454060 R12: 00000000ffff7e18
R13: 0000000000000000 R14: ffffffff804bca20 R15: 0000005895d994b3
FS:  0000000000000000(0000) GS:ffffffff80544d00(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000200200 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff80548000, task ffffffff80427880)
Stack: ffffffff804adc38 ffffffff801343ec 00000000ffff7e18 ffffffff804adcb8 
       ffffffff80359cd1 00000000080303fe 0000000000000246 ffffffff80427880 
       000000003b9aca00 ffffffff80427880 
Call Trace:<IRQ> <ffffffff801343ec>{deactivate_task+37} <ffffffff80359cd1>
{schedule+634} 
       <ffffffff8035b62f>{schedule_timeout+410} <ffffffff8014545f>
{process_timeout+0} 
       <ffffffffa00dafba>{:tg3:tg3_restart_hw+12} <ffffffffa00db5c6>
{:tg3:tg3_set_mac_addr+208} 
       <ffffffffa00c59f5>{:bonding:alb_set_slave_mac_addr+68} 
       <ffffffffa00d1000>{:tg3:tg3_write32+0} <ffffffffa00c5aba>
{:bonding:alb_swap_mac_addr+161} 
       <ffffffffa00bef81>{:bonding:bond_change_active_slave+520} 
       <ffffffffa00bf81d>{:bonding:bond_mii_monitor+944} <ffffffff80144d1d>
{run_timer_softirq+591} 
       <ffffffff80140164>{__do_softirq+76} <ffffffff801401eb>{do_softirq+49} 
       <ffffffff80113d8f>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} 
        <EOI> <ffffffff8010e886>{mwait_idle+85} <ffffffff8010e817>
{cpu_idle+26} 
       <ffffffff8054b6f8>{start_kernel+637} <ffffffff8054b1ab>{_sinittext+427} 
       

Code: 48 89 02 48 89 50 08 48 c7 41 08 00 02 20 00 8b 4f 2c 48 c7 
RIP <ffffffff801341a4>{dequeue_task+18} RSP <ffffffff804adc20>
CR2: 0000000000200200
 <0>Kernel panic - not syncing: Oops

Comment 9 Shyam kumar Iyer 2007-02-19 06:06:34 UTC

Created attachment 148312 [details]
diff output

This is the diff output of the 3.71 tg3 driver from broadcom and the native tg3
driver in R4.4

Comment 10 Andy Gospodarek 2007-02-19 13:55:09 UTC

Shyam,

Thanks for posting that info.  The tg3 problem is known upstream and a bugzilla
has been opened there:

http://bugzilla.kernel.org/show_bug.cgi?id=7974

Comment 11 John Feeney 2007-02-23 22:36:11 UTC

Per conversation with Charles Rose at Dell on 2/22/2007, Dell wants
this bugzilla to be a blocker of RHEL4-5 because it is a regression.
In addition, when bringing up a bonded interface in balanced-alb mode, 
the user sees "a ton of BUG spew." This is seen as a significant call 
generator for Dell. This bug is analogous to bz210577 which was 
deemed a blocker as well as a Day 0 Errata for the release it was 
associated with which should indicate the level of attention it 
has received.

The patch is most likely in the kernel. I am working on getting
the other appropriate flags set.

Comment 12 Andy Gospodarek 2007-02-23 23:31:12 UTC

Does this happen with the current 4.5 beta kernel? 

I don't see how this is a regression if it happens with 4.4 + dkms tg3 driver,
but doesn't happen with the 4.4 tg3 driver.

Comment 14 Shyam kumar Iyer 2007-03-09 12:41:39 UTC

Just as I feared after looking at the tg3 code in rhel4.5 beta kernel I do get 
the hang in the rhel4.5 bond setup now.
So as I said in comment 8 it is the ASF code that has been added in the newer 
tg3 drivers that is aggravating the bonding design problem. Attaching the 
serial trace.

Comment 15 Shyam kumar Iyer 2007-03-09 12:44:03 UTC

Created attachment 149677 [details]
bonding hang rhel4.5 beta1 kernel.

kernel version = 2.6.18-8.el5

Comment 16 Shyam kumar Iyer 2007-03-09 12:47:42 UTC

Sorry the kernel version is 2.6.9-48.EL.x86_64 in comment no.15.

Comment 17 Charles Rose 2007-03-14 10:35:53 UTC

Shyam has provided the required info.

Comment 18 Andy Gospodarek 2007-03-15 16:11:56 UTC

Agreed.  

Good progress is being made upstream on this issue so I hope to have something
that we can use for both rhel4 and rhel5 soon.  

The best way for you to help is to continue to test kernels for rhel5 so we can
find as many problems as possible.

Comment 19 Shyam kumar Iyer 2007-03-16 14:07:37 UTC

Yes. We but dell is shipping a lot of servers with rhel4 and with the current 
bug that is being introduced this needs equal attention as well.

Comment 20 Andy Gospodarek 2007-03-16 14:33:01 UTC

(In reply to comment #19)
> Yes. We but dell is shipping a lot of servers with rhel4 and with the current 
> bug that is being introduced this needs equal attention as well.

What version of the tg3 driver are you now giving to your customers to use on
rhel4.4 and earlier systems that creates this problem?

Comment 21 Shyam kumar Iyer 2007-03-16 14:50:16 UTC

Version 3.71

Comment 22 Shyam kumar Iyer 2007-03-22 11:40:36 UTC

In response to comment no.12 since this was not seen on r4.4 kernel and is 
being seen in r4.5 (native drivers only) this is a regression.

Comment 23 Samuel Benjamin 2007-03-23 02:08:52 UTC

This problem can be recreated easily with broadcom nics. In this RHEL4 issue,
the server experiences a kernel panic if the link goes down. This is far more
serious that what happens in RHEL5. 

Hence Dell is seeking an exception for this case and has requested us to initate
this for 4.5.

Comment 24 Samuel Benjamin 2007-03-26 15:28:45 UTC

Per communicaiton with enginering, the fix to this problem is still under
discussion and as soon as we have a working fix, we will provide it to Dell for
testing. When such a fix is confirmed to resolve the probelm, engineering will
evaluate the best possible mechanism to release it (hotfix, kernel errata, etc.)

-------- Original Message --------
Subject: 	Bug 216895: BUG: bringing up balanced-alb mode bond network interface
Date: 	Thu, 22 Mar 2007 17:14:10 +0530
From: 	<Charles_Rose>
To: 	<sbenjamin>, <ltroan>, <jfeeney>,
<rhentosh>
CC: 	<Shyam_Iyer>, <Jordan_Hargrave>

Sammy,

This issue is a blocker for Dell. We cannot have bonding broken and
which can cause a kernel panic on a live OS. Request you to get this
issue the right attention inside Red Hat. I agree that this is being
worked on in the community, we would like to see a fix in RHEL 4.5.

Thanks,
Charles Rose
Linux Engineering
Dell Product Group
Bangalore Development Center, INDIA

Comment 25 Andy Gospodarek 2007-03-29 00:51:09 UTC

I'm not a big fan of this patch, but I would like you to test kernels here:

http://people.redhat.com/agospoda/#rhel4

What I've done is backout the fix that resets any chips that have ASF enabled
only when changing the MAC address.  This is not the ideal solution, but it is
one that is possible if we can't get everything squared away with the bonding
driver in time.

I don't like the idea of causing problems with chips that possibly have ASF
enabled, but I'm not sure how many systems have this.

Please verify this patch as soon as possible.

Comment 26 John Feeney 2007-04-02 20:42:55 UTC

Since Shyam Iyer has been leading the charge on this recently, I am changing the
needinfo to him. Please respond with the test results as quickly as possible.
The deadline is approaching fast. Thank you.

Comment 27 Shyam kumar Iyer 2007-04-03 13:08:16 UTC

Created attachment 151555 [details]
bonding driver workqueue based.

Comment 28 Shyam kumar Iyer 2007-04-03 13:12:17 UTC

Andy,

Could not get a chance to test the tg3 driver modification. Above attachement 
is an update agnostic bonding driver with workqueue. 
Tomorrow I will update you with 2 things.
1) How the tg3 patch is working?
2) The exact patch for the bonding driver to rhel4.5

Comment 29 Andy Gospodarek 2007-04-03 14:45:12 UTC

Shyam,

You can plan to ship this code if you like, but I can guarantee that you will
see deadlocks from some customers.  Taking the rtnl_lock after read-locking the
bond-lock in mii_monitor will be problematic since there are other times (like
when calling bond_set_multicast_list) where rtnl is already held when you enter
a bonding function and you will wait to write-lock the bond-lock.  

I'm also not sure you can use _bh locks whenever doing a failover (in the case
of miimon at least) because you cannot sleep in softirq context (the current
problem that we have with the tg3 driver panicing the box).

I have heard that other distros are shipping a workqueue conversion patch much
like this (but without the extra rtnl-locks in the monitoring functions) and
reports are already coming in that their users are seeing deadlocks, so we need
to be careful.

-andy

Comment 30 Shyam kumar Iyer 2007-04-04 03:07:02 UTC

>I'm also not sure you can use _bh locks whenever doing a failover (in the case
>of miimon at least) because you cannot sleep in softirq context (the current
>problem that we have with the tg3 driver panicing the box).

Well, I can safely remove the bh write locks while failover. I have confirmed 
that from my testing. It was a last minute inclusion because of a lockup from 
the read_lock(&bond->curr_slave_lock). 

So does it now boil down to one problem? (rtnl_lock and multicast_list)

I will also try reproducing this issue. Did you get a chance to get a trace of 
rtnl_lock and the bond_set_multicast_list lockup?

Comment 31 Shyam kumar Iyer 2007-04-04 04:10:33 UTC

Andy,
I see looking back at the code what you are trying to say about the lockup. I 
did a subtle change which will throw away the deadlock.

->bond_mii_mon calls read_lock(&bond->lock).
->bond_set_multicast is called with rtnl_lock held and tries a write lock - 
fails.
->bond_mii_mon calls rtnl_lock for failover() - fails. 

Deadlock.

Now, my solution.
In mii_mon.
1) Call rtnl_lock first.
2) Then call read_lock.

Only if bond_set_multicast_list get rtnl_lock will it get the write_lock(&bond-
>lock).

At the most rtnl_lock assertions will fail some times which is better than 
deadlocks.

Comment 32 Shyam kumar Iyer 2007-04-04 06:38:45 UTC

Some thing like this.

diff -Naru bonding-3.2.1dell/bond_main.c bonding-3.2.2dell/bond_main.c
--- bonding-3.2.1dell/bond_main.c	2007-04-03 16:03:06.000000000 +0530
+++ bonding-3.2.2dell/bond_main.c	2007-04-04 11:59:34.972060432 +0530
@@ -2572,7 +2572,11 @@
 	int do_failover = 0;
 	int delta_in_ticks;
 	int i;
-	
+
+	 /* Grab the rtnl lock here so that if you have the read_lock(&bond-
>lock)
+	  * and anyone else like the bond_set_multicast don't deadlock while
+	  * holding the rtnl_lock and trying to get the write_lock(&bond->lock)
*/
+	rtnl_lock();
 	read_lock(&bond->lock); 
 
 	delta_in_ticks = (bond->params.miimon * HZ) / 1000;
@@ -2591,10 +2595,9 @@
 	 * program could monitor the link itself if needed.
 	 */
 
-	/* balance-rr lockup without bh lock. The bond_xmit_roundrobin fails 
to get a write lock
-	 * if read lock is held and bond_mii_mon gets scheduled. Remember we 
are in process
-	 * context		
-	 */
+	/* balance-rr locksup without this bh lock. The bond_xmit_roundrobin 
fails to get
+	 * a write lock if read lock is held and bond_mii_mon gets scheduled. 
Remember we
+	 * are in process context */
 	read_lock_bh(&bond->curr_slave_lock); 
 	oldcurrent = bond->curr_active_slave;
 	read_unlock_bh(&bond->curr_slave_lock);
@@ -2788,12 +2791,9 @@
 	} /* end of for */
 
 	if (do_failover) {
-		rtnl_lock();
-		write_lock_bh(&bond->curr_slave_lock); /* balance-rr lockup 
problem */
+		write_lock(&bond->curr_slave_lock); /* Bh locks not necessary 
here*/
 		bond_select_active_slave(bond);	
-		write_unlock_bh(&bond->curr_slave_lock); 
-		rtnl_unlock();
-
+		write_unlock(&bond->curr_slave_lock); 
 		if (oldcurrent && !bond->curr_active_slave) {
 			printk(KERN_INFO DRV_NAME
 			       ": %s: now running without any active "
@@ -2812,6 +2812,7 @@
 	}
 out:
 	read_unlock(&bond->lock); 
+	rtnl_unlock();
 }
 
 static void bond_arp_send_all(struct bonding *bond, struct slave *slave)
diff -Naru bonding-3.2.1dell/dkms.conf bonding-3.2.2dell/dkms.conf
--- bonding-3.2.1dell/dkms.conf	2007-04-03 16:16:08.000000000 +0530
+++ bonding-3.2.2dell/dkms.conf	2007-04-04 07:59:09.000000000 +0530
@@ -1,4 +1,4 @@
-PACKAGE_VERSION="3.2.1dell"
+PACKAGE_VERSION="3.2.2dell"
 
 # Items below here should not have to change with each driver version
 PACKAGE_NAME="bonding"

Comment 33 Andy Gospodarek 2007-04-04 17:52:12 UTC

Shyam,

Taking the rtnl_lock duing monitoring is WAAAAAAY too expensive.  Its only
needed when you are actually doing a failover.  You are correct though, that the
rtnl lock needs to be taken before bond->lock, so the proposal currently is too
do something like this:

if failover is needed
drop bond read-lock
take rtnl-lock
take bond read/write-lock 
do necessary work (maybe even slave-lock)
drop rtnl
drop bond-lock possibly later

Comment 34 Andy Gospodarek 2007-04-04 17:58:31 UTC

Created attachment 151687 [details]
proposed patch

This is a rhel4 backport of the following proposed upstream patch.

Comment 37 RHEL Program Management 2007-04-04 19:46:16 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 38 Jay Turner 2007-04-04 20:20:10 UTC

QE ack for 4.5.  Dell is really hot on this bug and since the patch came through
we can get Dell to do the majority of the validation.

Comment 40 Andy Gospodarek 2007-04-04 21:35:21 UTC

Test kernels available here:

http://people.redhat.com/agospoda/#rhel4

Comment 41 Shyam kumar Iyer 2007-04-05 09:54:04 UTC

Andy,
    The test kernel caused a kernel panic. Please see the trace below.


Setting network parameters:  [  OK  ]
Bringing up loopback interface:  [  OK  ]
Bringing up interface bond0:  Unable to handle kernel NULL pointer dereference 
at 00000000000002e8 RIP: 
<ffffffff802f965b>{netif_receive_skb+25}
PML4 12a401067 PGD 12aac1067 PMD 0 
Oops: 0000 [1] 
CPU 0 
Modules linked in: ds yenta_socket pcmcia_core joydev button battery ac 
uhci_hcd ehci_hcd hw_random tg3 e100 mii bonding(U) floppy ata_piix libata sg 
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod 
scsi_mod
Pid: 0, comm: swapper Not tainted 2.6.9-52.EL.gtest.15
RIP: 0010:[<ffffffff802f965b>] <ffffffff802f965b>{netif_receive_skb+25}
RSP: 0018:ffffffff804b24b8  EFLAGS: 00010296
RAX: 0000000000008e88 RBX: 000001012a5b7500 RCX: 0000000000008e88
RDX: 0000000000000000 RSI: 000001012e4f50a8 RDI: 000001012a5b7500
RBP: 000001012e4f5440 R08: 000001012abd0c80 R09: 000001012fd39870
R10: 000001012a5b7500 R11: 00000000000000e4 R12: 0000000000000001
R13: 0000000000000000 R14: 000000000000003c R15: 000001012abe0000
FS:  0000000000000000(0000) GS:ffffffff8054f200(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000000002e8 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff80552000, task ffffffff8042af80)
Stack: 0000000000000220 000001012a5b7500 000001012e4f5440 000001012a5b7500 
       000001012e4f5440 000001012a5b7500 0000000000000000 ffffffffa0111ae1 
       0000000000000002 000001012e4f5534 
Call Trace:<IRQ> <ffffffffa0111ae1>{:tg3:tg3_poll+2448} <ffffffff802f9c9c>
{net_rx_action+305} 
       <ffffffff80140494>{__do_softirq+76} <ffffffff8014051b>{do_softirq+49} 
       <ffffffff80113e5b>{do_IRQ+664} <ffffffff8011105b>{ret_from_intr+0} 
        <EOI> <ffffffff8010e86d>{mwait_idle+60} <ffffffff8010e817>
{cpu_idle+26} 
       <ffffffff805556f8>{start_kernel+637} <ffffffff805551ab>{_sinittext+427} 
       

Code: 48 83 ba e8 02 00 00 00 0f 84 53 01 00 00 31 c0 f6 82 8c 00 
RIP <ffffffff802f965b>{netif_receive_skb+25} RSP <ffffffff804b24b8>
CR2: 00000000000002e8
 <0>Kernel panic - not syncing: Oops

Comment 42 Shyam kumar Iyer 2007-04-05 11:26:39 UTC

If I use the patch that michael chan has sent over the rhel4.5 kernels then 
the issue is not reproducible. Is this a build problem?

Comment 43 Andy Gospodarek 2007-04-05 12:36:36 UTC

(In reply to comment #41)
>
> Modules linked in: ds yenta_socket pcmcia_core joydev button battery ac 
> uhci_hcd ehci_hcd hw_random tg3 e100 mii bonding(U) floppy ata_piix libata sg 
> dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod 
>

Shyam, Do you have one of your dkms bonding modules currently installed?  It
would seem to me that is why 'bonding(U)' appears in the list instead of
'bonding.'  Can you remove that and try again?

Comment 44 Shyam kumar Iyer 2007-04-05 13:23:23 UTC

Well no. The dkms drivers were loaded the first time it booted but then I 
removed them by booting to single user mode. 
This was the first doubt that I had and I made sure I had removed all dkms 
rpms.
I guess since the .ko files were removed and placed again with the 
original .ko by uninstalling the dkms rpm module the flag might come as U.

I am attaching the modinfo for the bonding driver and the tg3 driver. Please 
check the versions.
I have booted the system about 4 times with the same version of bonding and 
tg3 driver.

Comment 45 Shyam kumar Iyer 2007-04-05 13:24:29 UTC

Created attachment 151755 [details]
bonding module's modinfo output

Comment 46 Shyam kumar Iyer 2007-04-05 13:25:21 UTC

Created attachment 151756 [details]
tg3 module's modinfo output

Comment 47 Andy Gospodarek 2007-04-05 14:24:28 UTC

My latest test kernel also contains this patch:

http://people.redhat.com/agospoda/rhel4/gtest/tg3-update-3_73.patch

Which has a backport of a later tg3 driver included.  The problem with this
patch is right here:

@@ -3116,7 +3175,6 @@ static int tg3_alloc_rx_skb(struct tg3 *
 	if (skb == NULL)
 		return -ENOMEM;
 
-	skb->dev = tp->dev;
 	skb_reserve(skb, tp->rx_offset);
 
 	mapping = pci_map_single(tp->pdev, skb->data,

In newer upstream kernels netdev_alloc_skb() sets skb->dev = tp->dev, but in
RHEL4's tg3_compat.h we've just got netdev_alloc_skb looking like this:

#define netdev_alloc_skb(dev, len)      dev_alloc_skb(len)

I'll make a change and put out some new test kernels ASAP.

Comment 48 Andy Gospodarek 2007-04-05 15:36:50 UTC

2.6.9-53 will show up soon here:

http://people.redhat.com/jbaron/rhel4/RPMS.kernel/

Let me know if you don't see it later today.

Comment 49 Andy Gospodarek 2007-04-05 15:43:02 UTC

Sorry I wasn't clear on this -- it will contain the suggested tg3 patch without
any of my experimental updates.

Comment 50 Larry Troan 2007-04-05 19:25:11 UTC

Dell, please test this ASAP and report results here.

Comment 51 Jason Baron 2007-04-05 19:30:24 UTC

committed in stream U5 build 53. A test kernel with this patch is available from
http://people.redhat.com/~jbaron/rhel4/

Comment 52 Shyam kumar Iyer 2007-04-06 07:40:13 UTC

Fix verified.
Results are positive. Thanks for the speedy turnaround.

Comment 55 Issue Tracker 2007-04-19 21:31:50 UTC

A fix for this issue has been included in RHEL4.5. Please test the Release
Candidate of RHEL4.5, which was released today to Partners, and let us know
if the problem is resolved. The Release Candidate can be downloaded from
here:

ftp://partners.redhat.com/af38ac4316ba20df2dec5f990913396d

Internal Status set to 'Waiting on Customer'
Status set to: Waiting on Client
Resolution set to: 'RHEL 4.5'

This event sent from IssueTracker by gcase 
 issue 108058

Comment 56 Red Hat Bugzilla 2007-05-08 04:14:56 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html