Description of problem: Configuring network bonding causes kernel issues. Version-Release number of selected component (if applicable): Linux sfg-hou-nfs1 2.6.18-8.1.4.el5 #1 SMP Fri May 4 22:15:17 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Allways. Steps to Reproduce: [root@sfg-hou-nfs1 network-scripts]# cat ifcfg-bond0 DEVICE=bond0 IPADDR=10.10.253.65 NETMASK=255.255.255.0 NETWORK=10.10.253.0 BROADCAST=10.10.253.255 ONBOOT=yes BOOTPROTO=none USERCTL=no BONDING_OPTS="mode=balance-alb miimon=100" [root@sfg-hou-nfs1 network-scripts]# cat ifcfg-eth0 DEVICE=eth0 USERCTL=no ONBOOT=yes MASTER=bond0 SLAVE=yes HWADDR=00:19:B9:C3:C9:E3 BOOTPROTO=none [root@sfg-hou-nfs1 network-scripts]# cat ifcfg-eth1 # Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet DEVICE=eth1 USERCTL=no ONBOOT=yes MASTER=bond0 SLAVE=yes HWADDR=00:19:B9:C3:C9:E5 BOOTPROTO=none Actual results: bonding: bond0: Removing slave eth0 bonding: bond0: Warning: the permanent HWaddr of eth0 - 00:19:B9:C3:C9:E3 - is still in use by bond0. Set the HWaddr of eth0 to a different address to avoid conflicts. bonding: bond0: releasing active interface eth0 bonding: bond0: Removing slave eth1 bonding: bond0: releasing active interface eth1 bonding: bond0: setting mode to balance-alb (6). bonding: bond0: Setting MII monitoring interval to 100. ADDRCONF(NETDEV_UP): bond0: link is not ready bonding: bond0: Adding slave eth0. bnx2: eth0: using MSI ADDRCONF(NETDEV_UP): eth0: link is not ready bonding: bond0: enslaving eth0 as an active interface with a down link. bonding: bond0: Adding slave eth1. bnx2: eth1: using MSI ADDRCONF(NETDEV_UP): eth1: link is not ready bonding: bond0: enslaving eth1 as an active interface with a down link. bnx2: eth0 NIC Link is Up, 1000 Mbps full duplex ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready bonding: bond0: link status definitely up for interface eth0. bonding: bond0: making interface eth0 the new active one. RTNL: assertion failed at net/core/fib_rules.c (388) Call Trace: <IRQ> [<ffffffff80213c69>] fib_rules_event+0x3d/0xff [<ffffffff8006492c>] notifier_call_chain+0x20/0x32 [<ffffffff80206e22>] dev_set_mac_address+0x52/0x58 [<ffffffff8857d857>] :bonding:alb_set_slave_mac_addr+0x41/0x6c [<ffffffff8857dcd7>] :bonding:alb_swap_mac_addr+0x95/0x163 [<ffffffff88578f2f>] :bonding:bond_change_active_slave+0x1db/0x2f6 [<ffffffff88579a48>] :bonding:bond_select_active_slave+0xa5/0xda [<ffffffff8857adb4>] :bonding:bond_mii_monitor+0x3a8/0x3ee [<ffffffff8857aa0c>] :bonding:bond_mii_monitor+0x0/0x3ee [<ffffffff80092c4a>] run_timer_softirq+0x133/0x1b0 [<ffffffff80011c19>] __do_softirq+0x5e/0xd5 [<ffffffff8005c330>] call_softirq+0x1c/0x28 [<ffffffff8006a312>] do_softirq+0x2c/0x85 [<ffffffff80054f2e>] mwait_idle+0x0/0x4a [<ffffffff8005bcc2>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff80054f64>] mwait_idle+0x36/0x4a [<ffffffff80046fb7>] cpu_idle+0x95/0xb8 [<ffffffff80073bb7>] start_secondary+0x45a/0x469 RTNL: assertion failed at net/ipv4/devinet.c (984) Call Trace: <IRQ> [<ffffffff80238e22>] inetdev_event+0x48/0x27d [<ffffffff8003ccc5>] rt_run_flush+0x7f/0xb8 [<ffffffff8006492c>] notifier_call_chain+0x20/0x32 [<ffffffff80206e22>] dev_set_mac_address+0x52/0x58 [<ffffffff8857d857>] :bonding:alb_set_slave_mac_addr+0x41/0x6c [<ffffffff8857dcd7>] :bonding:alb_swap_mac_addr+0x95/0x163 [<ffffffff88578f2f>] :bonding:bond_change_active_slave+0x1db/0x2f6 [<ffffffff88579a48>] :bonding:bond_select_active_slave+0xa5/0xda [<ffffffff8857adb4>] :bonding:bond_mii_monitor+0x3a8/0x3ee [<ffffffff8857aa0c>] :bonding:bond_mii_monitor+0x0/0x3ee [<ffffffff80092c4a>] run_timer_softirq+0x133/0x1b0 [<ffffffff80011c19>] __do_softirq+0x5e/0xd5 [<ffffffff8005c330>] call_softirq+0x1c/0x28 [<ffffffff8006a312>] do_softirq+0x2c/0x85 [<ffffffff80054f2e>] mwait_idle+0x0/0x4a [<ffffffff8005bcc2>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff80054f64>] mwait_idle+0x36/0x4a [<ffffffff80046fb7>] cpu_idle+0x95/0xb8 [<ffffffff80073bb7>] start_secondary+0x45a/0x469 RTNL: assertion failed at net/core/fib_rules.c (388) Call Trace: <IRQ> [<ffffffff80213c69>] fib_rules_event+0x3d/0xff [<ffffffff8006492c>] notifier_call_chain+0x20/0x32 [<ffffffff80206e22>] dev_set_mac_address+0x52/0x58 [<ffffffff8857d857>] :bonding:alb_set_slave_mac_addr+0x41/0x6c [<ffffffff8857dce9>] :bonding:alb_swap_mac_addr+0xa7/0x163 [<ffffffff88578f2f>] :bonding:bond_change_active_slave+0x1db/0x2f6 [<ffffffff88579a48>] :bonding:bond_select_active_slave+0xa5/0xda [<ffffffff8857adb4>] :bonding:bond_mii_monitor+0x3a8/0x3ee [<ffffffff8857aa0c>] :bonding:bond_mii_monitor+0x0/0x3ee [<ffffffff80092c4a>] run_timer_softirq+0x133/0x1b0 [<ffffffff80011c19>] __do_softirq+0x5e/0xd5 [<ffffffff8005c330>] call_softirq+0x1c/0x28 [<ffffffff8006a312>] do_softirq+0x2c/0x85 [<ffffffff80054f2e>] mwait_idle+0x0/0x4a [<ffffffff8005bcc2>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff80054f64>] mwait_idle+0x36/0x4a [<ffffffff80046fb7>] cpu_idle+0x95/0xb8 [<ffffffff80073bb7>] start_secondary+0x45a/0x469 RTNL: assertion failed at net/ipv4/devinet.c (984) Call Trace: <IRQ> [<ffffffff80238e22>] inetdev_event+0x48/0x27d [<ffffffff8003ccc5>] rt_run_flush+0x7f/0xb8 [<ffffffff8006492c>] notifier_call_chain+0x20/0x32 [<ffffffff80206e22>] dev_set_mac_address+0x52/0x58 [<ffffffff8857d857>] :bonding:alb_set_slave_mac_addr+0x41/0x6c [<ffffffff8857dce9>] :bonding:alb_swap_mac_addr+0xa7/0x163 [<ffffffff88578f2f>] :bonding:bond_change_active_slave+0x1db/0x2f6 [<ffffffff88579a48>] :bonding:bond_select_active_slave+0xa5/0xda [<ffffffff8857adb4>] :bonding:bond_mii_monitor+0x3a8/0x3ee [<ffffffff8857aa0c>] :bonding:bond_mii_monitor+0x0/0x3ee [<ffffffff80092c4a>] run_timer_softirq+0x133/0x1b0 [<ffffffff80011c19>] __do_softirq+0x5e/0xd5 [<ffffffff8005c330>] call_softirq+0x1c/0x28 [<ffffffff8006a312>] do_softirq+0x2c/0x85 [<ffffffff80054f2e>] mwait_idle+0x0/0x4a [<ffffffff8005bcc2>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff80054f64>] mwait_idle+0x36/0x4a [<ffffffff80046fb7>] cpu_idle+0x95/0xb8 [<ffffffff80073bb7>] start_secondary+0x45a/0x469 bonding: bond0: first active interface up! ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready eth0: no IPv6 routers present bond0: no IPv6 routers present Additional info: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.4.44-1 (August 10, 2006) ACPI: PCI Interrupt 0000:05:00.0[A] -> GSI 16 (level, low) -> IRQ 169 eth0: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found at mem f8000000, IRQ 169, node addr 0019b9c3c9e3 ACPI: PCI Interrupt 0000:09:00.0[A] -> GSI 16 (level, low) -> IRQ 169 eth1: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found at mem f4000000, IRQ 169, node addr 0019b9c3c9e5 shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 [root@sfg-hou-nfs1 network-scripts]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.0.3 (March 23, 2006) Bonding Mode: adaptive load balancing Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:19:b9:c3:c9:e3 Slave Interface: eth1 MII Status: down Link Failure Count: 0 Permanent HW addr: 00:19:b9:c3:c9:e5
This is related to bug 210577 and fixes planned for that one should resolve this issue.
Same on Fedora 7 with e1000 on x86_64.
(In reply to comment #2) > Same on Fedora 7 with e1000 on x86_64. Not surprising since the fix isn't upstream yet.
Also, there are test kernels that contain a patch that should resolve this issue. You can get them here: http://people.redhat.com/agospoda/#rhel5 Any testing you can do would be appreciated!
gospo, I don't see any test kernels available. If one becomes available today I will test it. I updated to 2.6.18-8.1.6.el5 on a server that did not have bonding yet configured, but has the exact same hardware. We're still seeing the error but I don't think this problem was resolved in the new kernel update.
Jeremiah, You can download the test kernels here: http://people.redhat.com/agospoda/#rhel5 They are ones I've built for testing fixes and are not offically supported.
Hrm, I had went to that link earlier and nothing was displaying. Maybe a browser issue, anyways 2.6.18-22.el5.gtest.18 on my server I don't see any traces when I restart networking with bonding enabled.
Glad to hear that someone else gets the same results that I get. :-) Feel free to put that test kernel through any test cycles you like since I'd like to make sure others agree with me that the bugs are worked out.
We will continue to run your kernel until our testing phase for this system has finished. If we run into any other problems we will let you know.
What are the effects of this issue? It seems still present in latest EL5 kernels, but despite of the assertion failure messages, bonding-alb appears to work in some quick tests (tested with two tg3 interfaces).
we hit similar problems with OVZ kernels (based on RHEL5.1): The call trace is: bond_mii_monitor write_lock(&bond->curr_slave_lock); bond_select_active_slave bond_change_active_slave bond_alb_handle_active_change alb_set_slave_mac_addr dev_set_mac_address dev_set_mac_adress will call (sooner or later) device notifier chain which should be run under rtnl_lock exclusively without any other locks get. The mainstream kernel has this problem fixed by complete locking rewrite. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=history;f=drivers/net/bonding/bond_main.c;h=49a198206e3de901a74f34fc40694bb056ad922c;hb=HEAD see patches from Jay Vosburgh, 2007-10-24
Kirill, This is fixed already in the latest rhel5 dev kernels. You can also check my test kernels: http://people.redhat.com/agospoda/#rhel5 for some bonding updates that I hope to get included in 5.2. Any feedback you can provide is helpful.
*** This bug has been marked as a duplicate of 251902 ***