Bug 567604 - [Regression] bonding: 802.3ad problems with link detection
Summary: [Regression] bonding: 802.3ad problems with link detection
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Andy Gospodarek
QA Contact: Network QE
URL:
Whiteboard:
: 602071 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-02-23 11:46 UTC by Simon Fayer
Modified: 2018-11-14 17:03 UTC (History)
26 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 21:08:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
rhel5-bonding-cleanup.patch (2.08 KB, patch)
2010-03-09 19:32 UTC, Andy Gospodarek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Simon Fayer 2010-02-23 11:46:53 UTC
Description of problem:
Seen with CentOS5u4. When bonding 4 network ports on two network cards (tested with a 2-port e1000e card and a 2-port igb card) at 1GBit in 802.3ad mode, the cards form two aggregation groups which never merge. It seems this is being caused by incorrect detection of the ports' link speed when bond0 comes up which causes incorrect selection of the aggregation key for some of the interfaces. The aggregation key should be updated when the proper link speed is available, however this never happens.

Version-Release number of selected component (if applicable):
Problem seen with kernels: 
- 2.6.18-164.11.1.el5
- 2.6.18-164.10.1.el5
- 2.6.18-164.9.1.el5
- 2.6.18-164.6.1.el5
Problem *not* seen with kernel 2.6.18-128.2.1.el5.

How reproducible:
Seen every time at system boot, not reproducible with "service network restart".

Steps to Reproduce:
1. Configure a 802.3ad mode bond interface (create ifcfg-bond0).
2. Connect 4 network ports (2 from 2 cards) to a switch.
3. Add the 4 network ports to the bond in ifcfg-eth* scripts.
4. Configure network switch to cope with 802.3ad bonded interfaces.
5. Reboot the system
  
Actual results:
cat /proc/net/bonding/bond0 shows 2 aggregation groups of 2 ports each, i.e.
Two ports have "Aggregator ID: 2", two ports have "Aggregator ID: 3".

Expected results:
All ports in the bond should have the same Aggregator ID.

Additional info:
It seems that at boot time, the link speed returned from the card driver to the bonding module for one of the cards is 1GBit as expected. The other card driver returns a speed of -1, which causes the the default key selection of 100MBit. As these keys do not match, two aggregation groups are created.

The key should be updated by the bond_3ad_adapter_speed_changed function when the adapter is "ready" however this never happens. The bond_3ad_adapter_speed_changed function is no-longer referenced by any other function. It appears all calls to the bond_3ad_adapter_speed_changed function are removed by linux-2.6-net-bonding-update-to-upstream-version-3-4-0.patch. Bonding in this mode used to work correctly on older kernels.


/etc/sysconfig/network-scripts/ifcfg-bond0:
DEVICE=bond0
ONBOOT=yes
TYPE=Ethernet
IPADDR=192.168.0.1
NETMASK=255.255.255.0
BONDING_OPTS="mode=802.3ad miimon=100 xmit_hash_policy=layer2+3"

/etc/sysconfig/network-scripts/ifcfg-eth0 (and -eth1,-eth2,-eth3):
# Set to the relevant device name
DEVICE=eth0
# Set to the HW address of the relevant device
HWADDR=00:xx:xx:xx:xx:xx
ONBOOT=yes
MASTER=bond0
SLAVE=yes

/proc/net/bonding/bond0 with problem:
Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Active Aggregator Info:
        Aggregator ID: 3
        Number of ports: 2
        Actor Key: 17
        Partner Key: 16385
        Partner Mac Address: 00:22:67:xx:xx:xx

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:15:17:xx:xx:xx
Aggregator ID: 2

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:15:17:xx:xx:xx
Aggregator ID: 2

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:30:48:xx:xx:xx
Aggregator ID: 3

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:30:48:xx:xx:xx
Aggregator ID: 3

Comment 1 Andy Gospodarek 2010-03-04 15:42:42 UTC
Which hardware controls which devices?  Do eth0 and eth1 use the e1000e driver or the igb driver?

I'd like to look at those drivers as well as the bonding code.

Comment 2 Simon Fayer 2010-03-04 16:08:47 UTC
eth0 & eth1 are using the igb driver,
eth2 & eth3 are using the e1000e driver.

Comment 3 Andy Gospodarek 2010-03-04 21:51:54 UTC
Thanks, we will take a look at these two drivers and the possible differences in their return codes.

Comment 4 Kostas Georgiou 2010-03-05 11:38:45 UTC
Without the bond_3ad_adapter_speed_changed function I don't see how it can be fixed. For a example in the case of two ports and one driver what happens if one port is dissconnected during boot and is connected later on?

Comment 5 Andy Gospodarek 2010-03-09 19:27:07 UTC
I did some triage on this and it looks like this is our problem.  We took an update to version 3.4.0 in April 2009.

This change included the following upstream commit:

commit f0c76d61779b153dbfb955db3f144c62d02173c2
Author: Jay Vosburgh <fubar.com>
Date:   Wed Jul 2 18:21:58 2008 -0700

    bonding: refactor mii monitor

Time went by and it seems a bug was discovered with that commit, so the code to check speed and duplex and update it was added back here:

commit 17d04500e2528217de5fe967599f98ee84348a9c
Author: Jay Vosburgh <fubar.com>
Date:   Wed Mar 18 18:38:25 2009 -0700

    bonding: Fix updating of speed/duplex changes
    
        This patch corrects an omission from the following commit:
    
    commit f0c76d61779b153dbfb955db3f144c62d02173c2
    Author: Jay Vosburgh <fubar.com>
    Date:   Wed Jul 2 18:21:58 2008 -0700
    
        bonding: refactor mii monitor
    
        The un-refactored code checked the link speed and duplex of
    every slave on every pass; the refactored code did not do so.
    
        The 802.3ad and balance-alb/tlb modes utilize the speed and
    duplex information, and require it to be kept up to date.  This patch
    adds a notifier check to perform the appropriate updating when the slave
    device speed changes.

Comment 6 Andy Gospodarek 2010-03-09 19:32:50 UTC
Created attachment 398903 [details]
rhel5-bonding-cleanup.patch

I suspect this patch will resolve the issue.  Any test testing that can be done on it would be greatly appreciated.

Comment 7 Simon Fayer 2010-03-10 10:59:13 UTC
That patch does seem to correctly fix the problem (tested with kernel-2.6.18-164.11.1.el5).

Comment 8 Andy Gospodarek 2010-03-10 16:12:44 UTC
Awesome!   Thanks for testing that Simon.

Comment 9 Andy Gospodarek 2010-03-25 00:37:57 UTC
New test kernels available here:

http://people.redhat.com/agospoda/#rhel5

Any feedback you can provide is greatly apprecaited.

Comment 10 Simon Fayer 2010-03-29 10:20:32 UTC
Your latest test kernel (2.6.18-194.el5.gtest.86) does seem to resolve the issue correctly.

Comment 11 Andy Gospodarek 2010-03-29 13:56:55 UTC
Thanks, again!

Comment 12 RHEL Program Management 2010-03-29 14:15:51 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 17 Jarod Wilson 2010-05-19 19:00:13 UTC
in kernel-2.6.18-199.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 18 Simon Fayer 2010-05-24 11:27:34 UTC
Kernel-2.6.18-199.el5 does seem to fix this problem (Verified against the original set-up described in the opening comment of this ticket).

Comment 20 RHEL Program Management 2010-06-02 05:12:44 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 23 RHEL Program Management 2010-06-03 05:10:58 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 26 Andy Gospodarek 2010-06-23 13:47:18 UTC
*** Bug 602071 has been marked as a duplicate of this bug. ***

Comment 27 wmg 2010-07-09 02:00:43 UTC
Hi andy,

customer from it955673 agreed to help test this with our rhel5.5 kernel,
Do you have a place the customer could download our test kernel from?

Thanks,

Comment 28 Andy Gospodarek 2010-07-12 19:13:46 UTC
wmg, a patch to resolve this issue can be found in my test kernels here:

(listed in comment #9)
http://people.redhat.com/agospoda/#rhel5

and in the latest development kernels here:

(listed in comment #17)
http://people.redhat.com/jwilson/el5/

Please check the comments for links when a bug is in the MODIFIED state as a link is often listed for kernel bugs.

Comment 29 wmg 2010-07-16 02:05:11 UTC
(In reply to comment #28)
> wmg, a patch to resolve this issue can be found in my test kernels here:
> 
> (listed in comment #9)
> http://people.redhat.com/agospoda/#rhel5
> 
> and in the latest development kernels here:
> 
> (listed in comment #17)
> http://people.redhat.com/jwilson/el5/
> 
> Please check the comments for links when a bug is in the MODIFIED state as a
> link is often listed for kernel bugs.    

Andy, customer confirmed this works fine on 206.el5 kernel

Thanks,

Comment 30 Andy Gospodarek 2010-07-16 02:27:32 UTC
Thanks for the feedback, wmg!

Comment 33 Robert N. Evans 2010-10-28 19:45:56 UTC
Stratus has encountered this problem also.

Comment 37 errata-xmlrpc 2011-01-13 21:08:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.