This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 467244 - On RHEL 5.2 32 bit rmmod bonding results in a kernel panic when configured in balance-tlb mode
On RHEL 5.2 32 bit rmmod bonding results in a kernel panic when configured in...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.2
i686 Linux
high Severity high
: rc
: ---
Assigned To: Andy Gospodarek
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-10-16 10:50 EDT by Chris Tatman
Modified: 2014-06-29 19:00 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 14:44:05 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/tmp/bond-fix-tx-hashtable-panic.patch (901 bytes, patch)
2008-10-16 15:46 EDT, Andy Gospodarek
no flags Details | Diff

  None (edit)
Description Chris Tatman 2008-10-16 10:50:39 EDT
Description of problem:
On RHEL 5.2 32 bit issuing rmmod bonding after closing the device with "ifconfig bond0 down" results in a kernel panic. The bond was configured in balance-tlb mode.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
Execute the steps mentioned below in a script. The bond should be up and transmitting and a brief tcp connection should be established so that it has some entries in it's hashtables.

modprobe bonding mode=balance-tlb miimon=100
sleep 1
ifconfig bond0 172.16.64.85 netmask 255.255.192.0
sleep 1
ifenslave bond0 eth0
ifenslave bond0 eth1
sleep 1
echo hello | nc 172.16.64.52 100
sleep 1
ifconfig bond0 down
sleep 1
rmmod bonding

Actual results:
Results in a kernel panic

Expected results:
No kernel panic.

Additional info:
1. I observe that the bond is closed before the slaves are detached. When the "ifconfig bond0 down" is called tlb_deinitialize() frees the bond's transmit hash table

kfree (bond info->tx hashtbl),
bond info -> tx hashtbl = NULL;

When rmmod bonding is called tlb_clear_slave() might attempt to access this hashtable and this results in a kernel panic.

I would like to know if this is a valid issue ? Should this scenario be handled in a more graceful manner not resulting in a kernel panic ?

2. Issue is not seen if the slaves are detached before unloading the module

ifconfig bond0 down
echo "-eth0" > /sys/class/net/bond0/bonding/slaves
echo "-eth1" > /sys/class/net/bond0/bonding/slaves
sleep 1
rmmod bonding

3. Same behavior is seen in the upstream kernel version 2.6.26.5.
Comment 1 Andy Gospodarek 2008-10-16 15:35:01 EDT
My original attempt to fix this was no good since it contained the following warning when using alb-mode bonding:

Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)
bonding: In ALB mode you might experience client disconnections upon reconnection of a link if the bonding module updelay parameter (0 msec) is incompatible with the forwarding delay time of the switch
bonding: MII link monitoring set to 100 ms
ADDRCONF(NETDEV_UP): bond0: link is not ready
bnx2: eth0: using MSI
ADDRCONF(NETDEV_UP): eth0: link is not ready
bonding: bond0: enslaving eth0 as an active interface with a down link.
bnx2: eth1: using MSI
ADDRCONF(NETDEV_UP): eth1: link is not ready
bonding: bond0: enslaving eth1 as an active interface with a down link.
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
bonding: bond0: link status definitely up for interface eth0.
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: first active interface up!
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
bonding: bond0: link status definitely up for interface eth1.
bnx2: eth0 NIC Copper Link is Down
bonding: bond0: link status definitely down for interface eth0, disabling it
bonding: bond0: making interface eth1 the new active one.
device eth1 entered promiscuous mode
device eth1 left promiscuous mode
BUG: scheduling while atomic: rmmod/0x00000100/9094
 [<c06074e7>] schedule+0x43/0x9cd
 [<f89e6b6b>] fib6_clean_node+0x11/0x6a [ipv6]
 [<c06095fe>] _write_lock_bh+0x8/0x1a
 [<f89e6629>] fib6_walk+0x69/0x6e [ipv6]
 [<f89e6654>] fib6_clean_tree+0x26/0x2a [ipv6]
 [<c0607f23>] wait_for_completion+0x6b/0x8f
 [<c042027b>] default_wake_function+0x0/0xc
 [<c0434408>] synchronize_rcu+0x2a/0x2f
 [<c0434059>] wakeme_after_rcu+0x0/0x8
 [<f8a2824a>] bond_alb_deinitialize+0x1d/0x52 [bonding]
 [<f8a22c73>] bond_release_all+0x1da/0x1f9 [bonding]
 [<f8a22cf1>] bond_free_all+0x5f/0xd2 [bonding]
 [<f8a2a3b6>] bonding_exit+0x1e/0x28 [bonding]
 [<c043e80a>] sys_delete_module+0x192/0x1b8
 [<c04059bf>] apic_timer_interrupt+0x1f/0x24
 [<c0404eff>] syscall_call+0x7/0xb
 =======================
bonding: bond0: released all slaves

It seems the better option is to leave the functions where they are and check in tlb_clear_slave() if the hash-tbl has already been destroyed.
Comment 2 Andy Gospodarek 2008-10-16 15:46:07 EDT
Created attachment 320601 [details]
/tmp/bond-fix-tx-hashtable-panic.patch

This is probably a better fix.
Comment 3 Narendra K 2008-10-17 12:50:53 EDT
Andy,

(In reply to comment #1)
> It seems the better option is to leave the functions where they are and check
> in tlb_clear_slave() if the hash-tbl has already been destroyed.

I agree.

(In reply to comment #2)

> This is probably a better fix.

I tested the patch from comment #2. It fixes the issue.
Comment 4 Andy Gospodarek 2008-10-17 14:55:23 EDT
Narendra, thanks for testing that for me.  I'll propose that upstream.

Is it OK if I mention it was discovered by you and mention your email address (I'd like to give you credit) or would you prefer that I do not do that?
Comment 5 Narendra K 2008-10-24 08:23:10 EDT
(In reply to comment #4)
> Narendra, thanks for testing that for me.  I'll propose that upstream.
> Is it OK if I mention it was discovered by you and mention your email address
> (I'd like to give you credit) or would you prefer that I do not do that?

Andy,

Thanks for the gesture. Golaz(lee_golaz@dell.com) from our netwok team discoverd the issue. Myself and Thomas (thomas_chenault@dell.com) worked on the rootcause.
Comment 12 Don Zickus 2008-11-04 11:50:50 EST
in kernel-2.6.18-122.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 14 Narendra K 2008-11-07 04:46:19 EST
I tested the kernel-2.6.18-122.el5 from comment #12 and it fixes the issue. 

Andy,

Will this fix make it to RHEL 5.3 ?
Comment 15 Andy Gospodarek 2008-11-07 09:05:19 EST
Yes it will make 5.3.
Comment 16 Narendra K 2008-11-18 07:34:52 EST
I tested the this on RHEL 5.2 snapshot 2 kernel ( 2.6.18-122.el5). This kernel fixes the issue.
Comment 17 Raghavendra Biligiri 2008-11-24 00:47:34 EST
Updating the Status field based on comment #16.
Comment 19 errata-xmlrpc 2009-01-20 14:44:05 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.