Bug 467244 - On RHEL 5.2 32 bit rmmod bonding results in a kernel panic when configured in balance-tlb mode
Summary: On RHEL 5.2 32 bit rmmod bonding results in a kernel panic when configured in...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: i686
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Andy Gospodarek
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-10-16 14:50 UTC by Chris Tatman
Modified: 2018-10-20 02:55 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 19:44:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/tmp/bond-fix-tx-hashtable-panic.patch (901 bytes, patch)
2008-10-16 19:46 UTC, Andy Gospodarek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Chris Tatman 2008-10-16 14:50:39 UTC
Description of problem:
On RHEL 5.2 32 bit issuing rmmod bonding after closing the device with "ifconfig bond0 down" results in a kernel panic. The bond was configured in balance-tlb mode.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
Execute the steps mentioned below in a script. The bond should be up and transmitting and a brief tcp connection should be established so that it has some entries in it's hashtables.

modprobe bonding mode=balance-tlb miimon=100
sleep 1
ifconfig bond0 172.16.64.85 netmask 255.255.192.0
sleep 1
ifenslave bond0 eth0
ifenslave bond0 eth1
sleep 1
echo hello | nc 172.16.64.52 100
sleep 1
ifconfig bond0 down
sleep 1
rmmod bonding

Actual results:
Results in a kernel panic

Expected results:
No kernel panic.

Additional info:
1. I observe that the bond is closed before the slaves are detached. When the "ifconfig bond0 down" is called tlb_deinitialize() frees the bond's transmit hash table

kfree (bond info->tx hashtbl),
bond info -> tx hashtbl = NULL;

When rmmod bonding is called tlb_clear_slave() might attempt to access this hashtable and this results in a kernel panic.

I would like to know if this is a valid issue ? Should this scenario be handled in a more graceful manner not resulting in a kernel panic ?

2. Issue is not seen if the slaves are detached before unloading the module

ifconfig bond0 down
echo "-eth0" > /sys/class/net/bond0/bonding/slaves
echo "-eth1" > /sys/class/net/bond0/bonding/slaves
sleep 1
rmmod bonding

3. Same behavior is seen in the upstream kernel version 2.6.26.5.

Comment 1 Andy Gospodarek 2008-10-16 19:35:01 UTC
My original attempt to fix this was no good since it contained the following warning when using alb-mode bonding:

Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)
bonding: In ALB mode you might experience client disconnections upon reconnection of a link if the bonding module updelay parameter (0 msec) is incompatible with the forwarding delay time of the switch
bonding: MII link monitoring set to 100 ms
ADDRCONF(NETDEV_UP): bond0: link is not ready
bnx2: eth0: using MSI
ADDRCONF(NETDEV_UP): eth0: link is not ready
bonding: bond0: enslaving eth0 as an active interface with a down link.
bnx2: eth1: using MSI
ADDRCONF(NETDEV_UP): eth1: link is not ready
bonding: bond0: enslaving eth1 as an active interface with a down link.
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
bonding: bond0: link status definitely up for interface eth0.
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: first active interface up!
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
bonding: bond0: link status definitely up for interface eth1.
bnx2: eth0 NIC Copper Link is Down
bonding: bond0: link status definitely down for interface eth0, disabling it
bonding: bond0: making interface eth1 the new active one.
device eth1 entered promiscuous mode
device eth1 left promiscuous mode
BUG: scheduling while atomic: rmmod/0x00000100/9094
 [<c06074e7>] schedule+0x43/0x9cd
 [<f89e6b6b>] fib6_clean_node+0x11/0x6a [ipv6]
 [<c06095fe>] _write_lock_bh+0x8/0x1a
 [<f89e6629>] fib6_walk+0x69/0x6e [ipv6]
 [<f89e6654>] fib6_clean_tree+0x26/0x2a [ipv6]
 [<c0607f23>] wait_for_completion+0x6b/0x8f
 [<c042027b>] default_wake_function+0x0/0xc
 [<c0434408>] synchronize_rcu+0x2a/0x2f
 [<c0434059>] wakeme_after_rcu+0x0/0x8
 [<f8a2824a>] bond_alb_deinitialize+0x1d/0x52 [bonding]
 [<f8a22c73>] bond_release_all+0x1da/0x1f9 [bonding]
 [<f8a22cf1>] bond_free_all+0x5f/0xd2 [bonding]
 [<f8a2a3b6>] bonding_exit+0x1e/0x28 [bonding]
 [<c043e80a>] sys_delete_module+0x192/0x1b8
 [<c04059bf>] apic_timer_interrupt+0x1f/0x24
 [<c0404eff>] syscall_call+0x7/0xb
 =======================
bonding: bond0: released all slaves

It seems the better option is to leave the functions where they are and check in tlb_clear_slave() if the hash-tbl has already been destroyed.

Comment 2 Andy Gospodarek 2008-10-16 19:46:07 UTC
Created attachment 320601 [details]
/tmp/bond-fix-tx-hashtable-panic.patch

This is probably a better fix.

Comment 3 Narendra K 2008-10-17 16:50:53 UTC
Andy,

(In reply to comment #1)
> It seems the better option is to leave the functions where they are and check
> in tlb_clear_slave() if the hash-tbl has already been destroyed.

I agree.

(In reply to comment #2)

> This is probably a better fix.

I tested the patch from comment #2. It fixes the issue.

Comment 4 Andy Gospodarek 2008-10-17 18:55:23 UTC
Narendra, thanks for testing that for me.  I'll propose that upstream.

Is it OK if I mention it was discovered by you and mention your email address (I'd like to give you credit) or would you prefer that I do not do that?

Comment 5 Narendra K 2008-10-24 12:23:10 UTC
(In reply to comment #4)
> Narendra, thanks for testing that for me.  I'll propose that upstream.
> Is it OK if I mention it was discovered by you and mention your email address
> (I'd like to give you credit) or would you prefer that I do not do that?

Andy,

Thanks for the gesture. Golaz(lee_golaz) from our netwok team discoverd the issue. Myself and Thomas (thomas_chenault) worked on the rootcause.

Comment 12 Don Zickus 2008-11-04 16:50:50 UTC
in kernel-2.6.18-122.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 14 Narendra K 2008-11-07 09:46:19 UTC
I tested the kernel-2.6.18-122.el5 from comment #12 and it fixes the issue. 

Andy,

Will this fix make it to RHEL 5.3 ?

Comment 15 Andy Gospodarek 2008-11-07 14:05:19 UTC
Yes it will make 5.3.

Comment 16 Narendra K 2008-11-18 12:34:52 UTC
I tested the this on RHEL 5.2 snapshot 2 kernel ( 2.6.18-122.el5). This kernel fixes the issue.

Comment 17 Raghavendra Biligiri 2008-11-24 05:47:34 UTC
Updating the Status field based on comment #16.

Comment 19 errata-xmlrpc 2009-01-20 19:44:05 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html


Note You need to log in before you can comment on or make changes to this bug.