Description of problem: Seen with CentOS5u4. When bonding 4 network ports on two network cards (tested with a 2-port e1000e card and a 2-port igb card) at 1GBit in 802.3ad mode, the cards form two aggregation groups which never merge. It seems this is being caused by incorrect detection of the ports' link speed when bond0 comes up which causes incorrect selection of the aggregation key for some of the interfaces. The aggregation key should be updated when the proper link speed is available, however this never happens. Version-Release number of selected component (if applicable): Problem seen with kernels: - 2.6.18-164.11.1.el5 - 2.6.18-164.10.1.el5 - 2.6.18-164.9.1.el5 - 2.6.18-164.6.1.el5 Problem *not* seen with kernel 2.6.18-128.2.1.el5. How reproducible: Seen every time at system boot, not reproducible with "service network restart". Steps to Reproduce: 1. Configure a 802.3ad mode bond interface (create ifcfg-bond0). 2. Connect 4 network ports (2 from 2 cards) to a switch. 3. Add the 4 network ports to the bond in ifcfg-eth* scripts. 4. Configure network switch to cope with 802.3ad bonded interfaces. 5. Reboot the system Actual results: cat /proc/net/bonding/bond0 shows 2 aggregation groups of 2 ports each, i.e. Two ports have "Aggregator ID: 2", two ports have "Aggregator ID: 3". Expected results: All ports in the bond should have the same Aggregator ID. Additional info: It seems that at boot time, the link speed returned from the card driver to the bonding module for one of the cards is 1GBit as expected. The other card driver returns a speed of -1, which causes the the default key selection of 100MBit. As these keys do not match, two aggregation groups are created. The key should be updated by the bond_3ad_adapter_speed_changed function when the adapter is "ready" however this never happens. The bond_3ad_adapter_speed_changed function is no-longer referenced by any other function. It appears all calls to the bond_3ad_adapter_speed_changed function are removed by linux-2.6-net-bonding-update-to-upstream-version-3-4-0.patch. Bonding in this mode used to work correctly on older kernels. /etc/sysconfig/network-scripts/ifcfg-bond0: DEVICE=bond0 ONBOOT=yes TYPE=Ethernet IPADDR=192.168.0.1 NETMASK=255.255.255.0 BONDING_OPTS="mode=802.3ad miimon=100 xmit_hash_policy=layer2+3" /etc/sysconfig/network-scripts/ifcfg-eth0 (and -eth1,-eth2,-eth3): # Set to the relevant device name DEVICE=eth0 # Set to the HW address of the relevant device HWADDR=00:xx:xx:xx:xx:xx ONBOOT=yes MASTER=bond0 SLAVE=yes /proc/net/bonding/bond0 with problem: Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2+3 (2) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: slow Active Aggregator Info: Aggregator ID: 3 Number of ports: 2 Actor Key: 17 Partner Key: 16385 Partner Mac Address: 00:22:67:xx:xx:xx Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:15:17:xx:xx:xx Aggregator ID: 2 Slave Interface: eth1 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:15:17:xx:xx:xx Aggregator ID: 2 Slave Interface: eth2 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:30:48:xx:xx:xx Aggregator ID: 3 Slave Interface: eth3 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:30:48:xx:xx:xx Aggregator ID: 3
Which hardware controls which devices? Do eth0 and eth1 use the e1000e driver or the igb driver? I'd like to look at those drivers as well as the bonding code.
eth0 & eth1 are using the igb driver, eth2 & eth3 are using the e1000e driver.
Thanks, we will take a look at these two drivers and the possible differences in their return codes.
Without the bond_3ad_adapter_speed_changed function I don't see how it can be fixed. For a example in the case of two ports and one driver what happens if one port is dissconnected during boot and is connected later on?
I did some triage on this and it looks like this is our problem. We took an update to version 3.4.0 in April 2009. This change included the following upstream commit: commit f0c76d61779b153dbfb955db3f144c62d02173c2 Author: Jay Vosburgh <fubar.com> Date: Wed Jul 2 18:21:58 2008 -0700 bonding: refactor mii monitor Time went by and it seems a bug was discovered with that commit, so the code to check speed and duplex and update it was added back here: commit 17d04500e2528217de5fe967599f98ee84348a9c Author: Jay Vosburgh <fubar.com> Date: Wed Mar 18 18:38:25 2009 -0700 bonding: Fix updating of speed/duplex changes This patch corrects an omission from the following commit: commit f0c76d61779b153dbfb955db3f144c62d02173c2 Author: Jay Vosburgh <fubar.com> Date: Wed Jul 2 18:21:58 2008 -0700 bonding: refactor mii monitor The un-refactored code checked the link speed and duplex of every slave on every pass; the refactored code did not do so. The 802.3ad and balance-alb/tlb modes utilize the speed and duplex information, and require it to be kept up to date. This patch adds a notifier check to perform the appropriate updating when the slave device speed changes.
Created attachment 398903 [details] rhel5-bonding-cleanup.patch I suspect this patch will resolve the issue. Any test testing that can be done on it would be greatly appreciated.
That patch does seem to correctly fix the problem (tested with kernel-2.6.18-164.11.1.el5).
Awesome! Thanks for testing that Simon.
New test kernels available here: http://people.redhat.com/agospoda/#rhel5 Any feedback you can provide is greatly apprecaited.
Your latest test kernel (2.6.18-194.el5.gtest.86) does seem to resolve the issue correctly.
Thanks, again!
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-199.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
Kernel-2.6.18-199.el5 does seem to fix this problem (Verified against the original set-up described in the opening comment of this ticket).
*** Bug 602071 has been marked as a duplicate of this bug. ***
Hi andy, customer from it955673 agreed to help test this with our rhel5.5 kernel, Do you have a place the customer could download our test kernel from? Thanks,
wmg, a patch to resolve this issue can be found in my test kernels here: (listed in comment #9) http://people.redhat.com/agospoda/#rhel5 and in the latest development kernels here: (listed in comment #17) http://people.redhat.com/jwilson/el5/ Please check the comments for links when a bug is in the MODIFIED state as a link is often listed for kernel bugs.
(In reply to comment #28) > wmg, a patch to resolve this issue can be found in my test kernels here: > > (listed in comment #9) > http://people.redhat.com/agospoda/#rhel5 > > and in the latest development kernels here: > > (listed in comment #17) > http://people.redhat.com/jwilson/el5/ > > Please check the comments for links when a bug is in the MODIFIED state as a > link is often listed for kernel bugs. Andy, customer confirmed this works fine on 206.el5 kernel Thanks,
Thanks for the feedback, wmg!
Stratus has encountered this problem also.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html