Red Hat Bugzilla – Bug 467244
On RHEL 5.2 32 bit rmmod bonding results in a kernel panic when configured in balance-tlb mode
Last modified: 2014-06-29 19:00:43 EDT
Description of problem:
On RHEL 5.2 32 bit issuing rmmod bonding after closing the device with "ifconfig bond0 down" results in a kernel panic. The bond was configured in balance-tlb mode.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Execute the steps mentioned below in a script. The bond should be up and transmitting and a brief tcp connection should be established so that it has some entries in it's hashtables.
modprobe bonding mode=balance-tlb miimon=100
ifconfig bond0 172.16.64.85 netmask 255.255.192.0
ifenslave bond0 eth0
ifenslave bond0 eth1
echo hello | nc 172.16.64.52 100
ifconfig bond0 down
Results in a kernel panic
No kernel panic.
1. I observe that the bond is closed before the slaves are detached. When the "ifconfig bond0 down" is called tlb_deinitialize() frees the bond's transmit hash table
kfree (bond info->tx hashtbl),
bond info -> tx hashtbl = NULL;
When rmmod bonding is called tlb_clear_slave() might attempt to access this hashtable and this results in a kernel panic.
I would like to know if this is a valid issue ? Should this scenario be handled in a more graceful manner not resulting in a kernel panic ?
2. Issue is not seen if the slaves are detached before unloading the module
ifconfig bond0 down
echo "-eth0" > /sys/class/net/bond0/bonding/slaves
echo "-eth1" > /sys/class/net/bond0/bonding/slaves
3. Same behavior is seen in the upstream kernel version 188.8.131.52.
My original attempt to fix this was no good since it contained the following warning when using alb-mode bonding:
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)
bonding: In ALB mode you might experience client disconnections upon reconnection of a link if the bonding module updelay parameter (0 msec) is incompatible with the forwarding delay time of the switch
bonding: MII link monitoring set to 100 ms
ADDRCONF(NETDEV_UP): bond0: link is not ready
bnx2: eth0: using MSI
ADDRCONF(NETDEV_UP): eth0: link is not ready
bonding: bond0: enslaving eth0 as an active interface with a down link.
bnx2: eth1: using MSI
ADDRCONF(NETDEV_UP): eth1: link is not ready
bonding: bond0: enslaving eth1 as an active interface with a down link.
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
bonding: bond0: link status definitely up for interface eth0.
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: first active interface up!
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
bonding: bond0: link status definitely up for interface eth1.
bnx2: eth0 NIC Copper Link is Down
bonding: bond0: link status definitely down for interface eth0, disabling it
bonding: bond0: making interface eth1 the new active one.
device eth1 entered promiscuous mode
device eth1 left promiscuous mode
BUG: scheduling while atomic: rmmod/0x00000100/9094
[<f89e6b6b>] fib6_clean_node+0x11/0x6a [ipv6]
[<f89e6629>] fib6_walk+0x69/0x6e [ipv6]
[<f89e6654>] fib6_clean_tree+0x26/0x2a [ipv6]
[<f8a2824a>] bond_alb_deinitialize+0x1d/0x52 [bonding]
[<f8a22c73>] bond_release_all+0x1da/0x1f9 [bonding]
[<f8a22cf1>] bond_free_all+0x5f/0xd2 [bonding]
[<f8a2a3b6>] bonding_exit+0x1e/0x28 [bonding]
bonding: bond0: released all slaves
It seems the better option is to leave the functions where they are and check in tlb_clear_slave() if the hash-tbl has already been destroyed.
Created attachment 320601 [details]
This is probably a better fix.
(In reply to comment #1)
> It seems the better option is to leave the functions where they are and check
> in tlb_clear_slave() if the hash-tbl has already been destroyed.
(In reply to comment #2)
> This is probably a better fix.
I tested the patch from comment #2. It fixes the issue.
Narendra, thanks for testing that for me. I'll propose that upstream.
Is it OK if I mention it was discovered by you and mention your email address (I'd like to give you credit) or would you prefer that I do not do that?
(In reply to comment #4)
> Narendra, thanks for testing that for me. I'll propose that upstream.
> Is it OK if I mention it was discovered by you and mention your email address
> (I'd like to give you credit) or would you prefer that I do not do that?
Thanks for the gesture. Golaz(email@example.com) from our netwok team discoverd the issue. Myself and Thomas (firstname.lastname@example.org) worked on the rootcause.
You can download this test kernel from http://people.redhat.com/dzickus/el5
I tested the kernel-2.6.18-122.el5 from comment #12 and it fixes the issue.
Will this fix make it to RHEL 5.3 ?
Yes it will make 5.3.
I tested the this on RHEL 5.2 snapshot 2 kernel ( 2.6.18-122.el5). This kernel fixes the issue.
Updating the Status field based on comment #16.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.