Description of problem: On RHEL 5.2 32 bit issuing rmmod bonding after closing the device with "ifconfig bond0 down" results in a kernel panic. The bond was configured in balance-tlb mode. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: Execute the steps mentioned below in a script. The bond should be up and transmitting and a brief tcp connection should be established so that it has some entries in it's hashtables. modprobe bonding mode=balance-tlb miimon=100 sleep 1 ifconfig bond0 172.16.64.85 netmask 255.255.192.0 sleep 1 ifenslave bond0 eth0 ifenslave bond0 eth1 sleep 1 echo hello | nc 172.16.64.52 100 sleep 1 ifconfig bond0 down sleep 1 rmmod bonding Actual results: Results in a kernel panic Expected results: No kernel panic. Additional info: 1. I observe that the bond is closed before the slaves are detached. When the "ifconfig bond0 down" is called tlb_deinitialize() frees the bond's transmit hash table kfree (bond info->tx hashtbl), bond info -> tx hashtbl = NULL; When rmmod bonding is called tlb_clear_slave() might attempt to access this hashtable and this results in a kernel panic. I would like to know if this is a valid issue ? Should this scenario be handled in a more graceful manner not resulting in a kernel panic ? 2. Issue is not seen if the slaves are detached before unloading the module ifconfig bond0 down echo "-eth0" > /sys/class/net/bond0/bonding/slaves echo "-eth1" > /sys/class/net/bond0/bonding/slaves sleep 1 rmmod bonding 3. Same behavior is seen in the upstream kernel version 2.6.26.5.
My original attempt to fix this was no good since it contained the following warning when using alb-mode bonding: Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008) bonding: In ALB mode you might experience client disconnections upon reconnection of a link if the bonding module updelay parameter (0 msec) is incompatible with the forwarding delay time of the switch bonding: MII link monitoring set to 100 ms ADDRCONF(NETDEV_UP): bond0: link is not ready bnx2: eth0: using MSI ADDRCONF(NETDEV_UP): eth0: link is not ready bonding: bond0: enslaving eth0 as an active interface with a down link. bnx2: eth1: using MSI ADDRCONF(NETDEV_UP): eth1: link is not ready bonding: bond0: enslaving eth1 as an active interface with a down link. bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready bonding: bond0: link status definitely up for interface eth0. bonding: bond0: making interface eth0 the new active one. bonding: bond0: first active interface up! ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready bonding: bond0: link status definitely up for interface eth1. bnx2: eth0 NIC Copper Link is Down bonding: bond0: link status definitely down for interface eth0, disabling it bonding: bond0: making interface eth1 the new active one. device eth1 entered promiscuous mode device eth1 left promiscuous mode BUG: scheduling while atomic: rmmod/0x00000100/9094 [<c06074e7>] schedule+0x43/0x9cd [<f89e6b6b>] fib6_clean_node+0x11/0x6a [ipv6] [<c06095fe>] _write_lock_bh+0x8/0x1a [<f89e6629>] fib6_walk+0x69/0x6e [ipv6] [<f89e6654>] fib6_clean_tree+0x26/0x2a [ipv6] [<c0607f23>] wait_for_completion+0x6b/0x8f [<c042027b>] default_wake_function+0x0/0xc [<c0434408>] synchronize_rcu+0x2a/0x2f [<c0434059>] wakeme_after_rcu+0x0/0x8 [<f8a2824a>] bond_alb_deinitialize+0x1d/0x52 [bonding] [<f8a22c73>] bond_release_all+0x1da/0x1f9 [bonding] [<f8a22cf1>] bond_free_all+0x5f/0xd2 [bonding] [<f8a2a3b6>] bonding_exit+0x1e/0x28 [bonding] [<c043e80a>] sys_delete_module+0x192/0x1b8 [<c04059bf>] apic_timer_interrupt+0x1f/0x24 [<c0404eff>] syscall_call+0x7/0xb ======================= bonding: bond0: released all slaves It seems the better option is to leave the functions where they are and check in tlb_clear_slave() if the hash-tbl has already been destroyed.
Created attachment 320601 [details] /tmp/bond-fix-tx-hashtable-panic.patch This is probably a better fix.
Andy, (In reply to comment #1) > It seems the better option is to leave the functions where they are and check > in tlb_clear_slave() if the hash-tbl has already been destroyed. I agree. (In reply to comment #2) > This is probably a better fix. I tested the patch from comment #2. It fixes the issue.
Narendra, thanks for testing that for me. I'll propose that upstream. Is it OK if I mention it was discovered by you and mention your email address (I'd like to give you credit) or would you prefer that I do not do that?
(In reply to comment #4) > Narendra, thanks for testing that for me. I'll propose that upstream. > Is it OK if I mention it was discovered by you and mention your email address > (I'd like to give you credit) or would you prefer that I do not do that? Andy, Thanks for the gesture. Golaz(lee_golaz) from our netwok team discoverd the issue. Myself and Thomas (thomas_chenault) worked on the rootcause.
in kernel-2.6.18-122.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
I tested the kernel-2.6.18-122.el5 from comment #12 and it fixes the issue. Andy, Will this fix make it to RHEL 5.3 ?
Yes it will make 5.3.
I tested the this on RHEL 5.2 snapshot 2 kernel ( 2.6.18-122.el5). This kernel fixes the issue.
Updating the Status field based on comment #16.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html