Bug 598573 - [Stratus 5.6 bug] bonding fails with 10Gb Ethernet links and 1Gb switch ports
[Stratus 5.6 bug] bonding fails with 10Gb Ethernet links and 1Gb switch ports
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.5
All Linux
high Severity high
: beta
: 5.6
Assigned To: Andy Gospodarek
Red Hat Kernel QE team
: OtherQA
Depends On:
Blocks: 557597
  Show dependency treegraph
 
Reported: 2010-06-01 12:18 EDT by Dan Duval
Modified: 2014-06-29 19:02 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-14 12:42:24 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
see problem description (20.00 KB, application/x-tar)
2010-06-01 12:18 EDT, Dan Duval
no flags Details
ifconfig and tcpdump output (150.00 KB, application/x-tar)
2010-06-02 11:26 EDT, Dan Duval
no flags Details
tcpdump capture file from failed ifup (50.48 KB, application/octet-stream)
2010-06-03 16:38 EDT, Dan Duval
no flags Details

  None (edit)
Description Dan Duval 2010-06-01 12:18:04 EDT
Created attachment 418720 [details]
see problem description

Description of problem:

This problem arises on a system where:

	. the NICs are 82599-based 10Gb/s Ethernet;

	. they are connected to 1 Gb/s SFP modules in a switch; and,

	. an attempt is made to create an active/backup (mode 1)
	  bond pair with two of the NICs.

When configured as individual non-bonded interfaces, the 10Gb cards
work as advertised.  They can obtain IP addresses from DHCP, and
communication via the NICs proceeds as one would expect.

When two such interfaces are placed into an active/backup bond
pair, though, the resulting bond device is unusable.

Specifically, about 80-90% of the time, the attempt to bring up
the bond device fails with an inability to obtain an IP address.
The rest of the time, an IP address is obtained, but attempts to
pass traffic across the resulting link fail with "host unreachable".

Interestingly, if one of the network cables is disconnected, the
bond device comes up and functions correctly.

More details:

All tests were done in a DHCP environment.

The NICs are Intel X520-series, with PCI device ID 0x10fb.

The SFPs in the switch are Finisar FTLF8524P2BNV modules.  These
are nominally 4.25Gb/s Fibre Channel devices, but are spec'd to be rate-selectable for 1000BASE-SX Ethernet.  We have used these
modules for 1Gb Ethernet testing without a problem.

The switch is an Extreme Summit X650-24x, which is 10Gb/s-capable.


Version-Release number of selected component (if applicable):

Reproduced with vanilla 2.6.18-194.el5 kernel, and with Andy
Gospodarek's 2.6.18-194.el5.gtest.87 test kernel.


How reproducible:

Happens every time under the specified conditions.


Steps to Reproduce:

1. Install two 82599EB-based 10Gb NICs into a system
2. Connect to 1Gb switch ports
3. Configure the two NICs into an active/backup bond pair.
4. Attempt to "ifup" that bond device.
  
Actual results:

Communication through the bond device fails reliably.

Expected results:

Communication through the bond device succeeds reliably.

Additional info:

Attached to this bug report is (should be) a tar file containing
the ifcfg files for the two NICs and for the bond device, a snippet
of the /var/log/messages file taken during a failed bringup, and the
contents of /proc/net/bonding/bond2 after such a bringup.
Comment 1 Dan Duval 2010-06-01 12:20:05 EDT
Assigning this to Andy at his request.
Comment 2 Andy Gospodarek 2010-06-01 15:52:18 EDT
Thanks for sending that information to me.  I don't see anything right away that looks like a problem.  There were some issues with 1Gbps links with flow control and 82598, but that doesn't seem like it would be an issue here since you are using 82599 devices and flow control appears to be disabled (which is good).  Have a few questions

1.  Do the card work in 10Gbps mode when in a bond?  You indicated the cards worked independently at 10Gbps but what about when using the active-backup bond.

2.  Do the cards work independently when the speed is set to 1Gps?  I just want to make sure I can understand if this issue is specific to 1Gps or bonding and this (along with the answer to #1 will close that loop.)

3.  Spanning tree on the switch isn't preventing DHCP from working correctly, right?  (I hate to ask this, but I want to be sure.)

4.  When the current setup isn't working, does it begin to work if you run: 

# echo eth110400 > /sys/class/net/bond2/bonding/active_slave

(I chose this value as the bond2 file attached showed eth100400 as the currently active slave interface.  You could switch the device listed to the one that is not the active link in /proc/net/bonding/bond2 as needed.)
 
It seems interesting that only using a single interface in the bond makes the bond work and I wondered if somehow the switch and the bonding code are out of sync.

Thanks for the help testing this.
Comment 3 Dan Duval 2010-06-01 17:46:00 EDT
Answers:

1. An active/backup bond works correctly when the cards are connected to
10Gb ports.

2. If the question is whether the 10Gb cards work properly when connected
to 1Gb ports and configured as individual interfaces, the answer is yes,
they do.  bonding is an essential factor in producing this problem.

3. If spanning tree were interfering with DHCP, I would expect that in
the 10-20% of cases where the bond device does obtain an IP address from
DHCP, that thereafter communication would proceed normally.  Not so?

Also, I'm using the same switch for both 10Gb and 1Gb connections, just
varying the SFP/+ modules that are inserted into the switch ports.  I'd
think that, if spanning tree were an issue, that it would be an issue
in both cases.

4. Interesting!  When I change the active slave while the bond is trying
to get a DHCP address, the result is that the DHCP succeeds, and the
result is a working bond pair through which I can ping the local router.
I tried this three times, and got the same result each time.

Then, to confirm that the eth100400 link is not simply a bad actor, I
brought up the bond with the other link disconnected.  This was
successful.
Comment 4 Andy Gospodarek 2010-06-01 22:43:54 EDT
(In reply to comment #3)
> Answers:
> 
> 1. An active/backup bond works correctly when the cards are connected to
> 10Gb ports.

OK, good.

> 2. If the question is whether the 10Gb cards work properly when connected
> to 1Gb ports and configured as individual interfaces, the answer is yes,
> they do.  bonding is an essential factor in producing this problem.

Also good data.

> 3. If spanning tree were interfering with DHCP, I would expect that in
> the 10-20% of cases where the bond device does obtain an IP address from
> DHCP, that thereafter communication would proceed normally.  Not so?
> 
> Also, I'm using the same switch for both 10Gb and 1Gb connections, just
> varying the SFP/+ modules that are inserted into the switch ports.  I'd
> think that, if spanning tree were an issue, that it would be an issue
> in both cases.

This one was a long-shot, but I wanted to mention it just in case.
 
> 4. Interesting!  When I change the active slave while the bond is trying
> to get a DHCP address, the result is that the DHCP succeeds, and the
> result is a working bond pair through which I can ping the local router.
> I tried this three times, and got the same result each time.

This is *extremely* interesting.

> Then, to confirm that the eth100400 link is not simply a bad actor, I
> brought up the bond with the other link disconnected.  This was
> successful.    

That is good to know too.  Thanks for trying that.  One more question right now.

The output for

# ethtool -i eth100400

and 

# ethtool -i eth110400

do not change across boots, right?  I'm mostly looking to be sure that the proper pci device is assigned to the proper net-device each time.  I'm not sure it would make a difference, but I thought I would ask. :)

Could you also attach:

1. Output from `ifconfig` on the system when in the bond and using 1G links.

and

2. If possible a raw tcpdump or wireshark capture of a working and failing dhcp request.

Thanks!
Comment 5 Dan Duval 2010-06-02 11:26:41 EDT
Created attachment 419073 [details]
ifconfig and tcpdump output
Comment 6 Dan Duval 2010-06-02 11:35:59 EDT
Proper operation of our servers requires that we know the exact physical
location of any given I/O device.  So the answer is no, it's not possible
for device names to "wander".  "eth100400" always refers to the same
device, as long as it occupies a given slot.

In fact, that's why you're seeing these funky device names instead of
"eth0" and so forth.  Encoded in that "100400" is a reference to a
specific physical slot in our chassis.  It's invariant, even across
configuration changes (e.g., addition or removal of other PCI devices,
even cards with on-board bridges).

We use custom udev scripts to ensure that this is the case.

The attachment I just added is a tar file containing the requested
output.  The "ifconfig" is from one of the instances where the bond
device succeeded in getting a DHCP address, but the network link was
unusable.  The "tcpdump" is from a different attempt where no DHCP
address was obtained.

You can see the DHCP requests going out, but no replies coming back.
Comment 7 Andy Gospodarek 2010-06-03 15:19:17 EDT
Thanks for posting those Dan.  The tcpdump output would be more helpful if captured like this:

# tcpdump -s 0 -w dhcp-log.cap

as I can load it up in wireshark and do more detailed inspection of the frames when it fails.  Do you think you could capture a dhcp failure like that for me?

Just so you know, I'm working with our lab to try and reproduce this in-house as well.
Comment 8 Dan Duval 2010-06-03 16:38:19 EDT
Created attachment 420015 [details]
tcpdump capture file from failed ifup

Better tcpdump data, as requested.
Comment 9 Andy Gospodarek 2010-06-03 23:34:40 EDT
Thanks for the tcpdump, Dan.

I'm not sure we have any 1Gbps mini-gbics available, but we are still looking.
Comment 10 Dan Duval 2010-06-08 12:21:43 EDT
Any luck with those SFPs?  Are you in Westford?  We could lend you
a pair if lack of 1Gb optics is still an issue.
Comment 11 Andy Gospodarek 2010-06-08 14:16:48 EDT
Dan, I'm not in Westford, but the systems are. :)

I've added Arlinton and Matt to the cc-list for this bug as hopefully they can reply if the 1Gbps SFPs are working and if-not contact you about borrowing some as long as they will work in our switch.

I also looked at the tcpdump, but will add another comment for thoughts on it.
Comment 14 Andy Gospodarek 2010-06-09 14:10:30 EDT
n reply to comment #13)
> OK, it looks like I just reproduced this.  I'll start digging around in the
> driver and see if I can understand what is happening.    

Actually I need to take that back.  I am not able to reproduce this with my system.

[root@dell-pet410-01 network-scripts]# ethtool eth2
Settings for eth2:
	Supported ports: [ FIBRE ]
	Supported link modes:   1000baseT/Full 
                               10000baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:  1000baseT/Full 
	                        10000baseT/Full 
	Advertised auto-negotiation: Yes
	Speed: 1000Mb/s
	Duplex: Full
	Port: FIBRE
	PHYAD: 0
	Transceiver: external
	Auto-negotiation: on
	Supports Wake-on: d
	Wake-on: d
	Current message level: 0x00000007 (7)
	Link detected: yes
[root@dell-pet410-01 network-scripts]# ethtool eth3
Settings for eth3:
	Supported ports: [ FIBRE ]
	Supported link modes:   1000baseT/Full 
                               10000baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:  1000baseT/Full 
	                        10000baseT/Full 
	Advertised auto-negotiation: Yes
	Speed: 1000Mb/s
	Duplex: Full
	Port: FIBRE
	PHYAD: 0
	Transceiver: external
	Auto-negotiation: on
	Supports Wake-on: d
	Wake-on: d
	Current message level: 0x00000007 (7)
	Link detected: yes
[root@dell-pet410-01 network-scripts]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 1000
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:37:b7:2c

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:37:b7:2d
[root@dell-pet410-01 network-scripts]# more ifcfg-bond0
DEVICE=bond0
BONDING_OPTS="miimon=1000 mode=1"
ONBOOT=yes
BOOTPROTO=dhcp
[root@dell-pet410-01 network-scripts]# more ifcfg-eth2
# Intel Corporation 82599EB 10-Gigabit Network Connection
DEVICE=eth2
HWADDR=00:1B:21:37:B7:2C
ONBOOT=yes
SLAVE=yes
MASTER=bond0
[root@dell-pet410-01 network-scripts]# more ifcfg-eth3
# Intel Corporation 82599EB 10-Gigabit Network Connection
DEVICE=eth3
HWADDR=00:1B:21:37:B7:2D
ONBOOT=yes
SLAVE=yes
MASTER=bond0

Relevant dmesg info:

Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008)
bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters mu
st be specified, otherwise bonding will not detect link failures! see bonding.txt for 
details.
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: Setting MII monitoring interval to 1000.
bonding: bond0: setting mode to active-backup (1).
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: Adding slave eth2.
bonding: bond0: enslaving eth2 as a backup interface with a down link.
bonding: bond0: Adding slave eth3.
bonding: bond0: enslaving eth3 as a backup interface with a down link.
ixgbe: eth2 NIC Link is Up 1 Gbps, Flow Control: RX/TX
ixgbe: eth3 NIC Link is Up 1 Gbps, Flow Control: RX/TX
bonding: bond0: link status definitely up for interface eth2.
bonding: bond0: making interface eth2 the new active one.
bonding: bond0: first active interface up!
bonding: bond0: link status definitely up for interface eth3.
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
bond0: no IPv6 routers present
Comment 15 Andy Gospodarek 2010-06-09 14:26:28 EDT
A few more bits of information:

[root@dell-pet410-01 ~]# ethtool -i eth2
driver: ixgbe
version: 2.0.44-k2
firmware-version: 0.9-2
bus-info: 0000:04:00.0
[root@dell-pet410-01 ~]# ethtool -i eth3
driver: ixgbe
version: 2.0.44-k2
firmware-version: 0.9-2
bus-info: 0000:04:00.1
[root@dell-pet410-01 ~]# lspci -n -s 0000:04:00.0 
04:00.0 0200: 8086:10fb (rev 01)
[root@dell-pet410-01 ~]# lspci -n -s 0000:04:00.1 
04:00.1 0200: 8086:10fb (rev 01)
[root@dell-pet410-01 ~]# uname -r
2.6.18-194.el5

I also noticed that it did take some time for the link to come up and gain an address via DHCP.  This turned out to be a STP issue, so I suspect a setup that had STP disabled would not see a delay like this

Can we start to examine the switch configuration to understand if there is something different there?

I've seen our Cisco switch configuration and there is nothing special abuot it other than flow control being enabled.

Output of the following commands on your system after failure might be interesting as well.

[root@dell-pet410-01 ~]# ethtool -S eth2 | grep -v :\ 0$ 
NIC statistics:
     rx_packets: 4333
     tx_packets: 83
     rx_bytes: 424010
     tx_bytes: 11240
     rx_pkts_nic: 4333
     tx_pkts_nic: 83
     rx_bytes_nic: 441342
     tx_bytes_nic: 11934
     lsc_int: 1
     multicast: 156
     broadcast: 4150
     hw_rsc_aggregated: 4333
     hw_rsc_flushed: 4333
     tx_queue_0_packets: 83
     tx_queue_0_bytes: 11240
     rx_queue_0_packets: 4333
     rx_queue_0_bytes: 424010
[root@dell-pet410-01 ~]# ethtool -S eth3 | grep -v :\ 0$ 
NIC statistics:
     rx_packets: 4184
     rx_bytes: 401283
     rx_pkts_nic: 4184
     rx_bytes_nic: 418019
     lsc_int: 1
     broadcast: 4183
     hw_rsc_aggregated: 4184
     hw_rsc_flushed: 4184
     rx_queue_0_packets: 4184
     rx_queue_0_bytes: 401283
Comment 16 Andy Gospodarek 2010-06-14 11:18:22 EDT
Any thoughts on my last 2 comments, Dan?
Comment 17 Dan Duval 2010-06-14 12:34:38 EDT
OK, I've figured out what's going wrong.  The locus of the
problem is between my ears.

At Andy's suggestion, I logged into the switch and began to
poke around.  What I found was that the two switch ports to
which the 10Gb cards were connected had been configured into
an 802.3ad link agg group.  So it's no surprise that the active/
backup config didn't work.

I moved the SFPs to another pair of (non-trunked) ports, and
the bond now works correctly.

Please accept my apologies for not having caught this sooner.

It's interesting that the "failing" configuration appears to
work properly with the RHEL 6.0 development kernels.

Anyway, I think this bug can be closed.  Shall I do so?
Comment 18 Andy Gospodarek 2010-06-14 13:23:03 EDT
No problem, Dan.  Glad it is working for you.

I do not suspect that anything specific to the kernel version would cause this to fail or succeed.  I could imagine a scenario where the bond on the switch would be up and the hashing algorithm it used for output port selection would select the interface in the active-backup bond that was supposed to drop the frames since it was the backup.  If may have been previously working based on the destination MAC, IP, or order the interfaces came up.  I suspect that was the source of the previous success and more recent failure.

Note You need to log in before you can comment on or make changes to this bug.