Bug 227005

Summary: speed limit on bonding interface
Product: Red Hat Enterprise Linux 4 Reporter: David Kostal <david.kostal>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.4CC: peterm
Target Milestone: ---   
Target Release: ---   
Hardware: ia32e   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-02-15 16:42:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Kostal 2007-02-02 16:11:28 UTC
Description of problem:
I have Dell PE2850 (2xcpu,8GB ram) with 4 gigabit cards (e1000) attached to
single Cisco 4506. I try to create bonding interfaces with 2 slaves. I try with
balance-alb, but similar behaviour is also with balance-tlb and balance-rr. i do
not have any special switch settings.
I am able to receive 2x 1Gbps streams on bond0 (balance-alb) from two other hosts.
I am able to send 1Gbps stream to one other host via bond0.
I am able to send 2x 1Gbps streams to host1 via bond0 and host3 via bond1.
I am only able to send 2x 0.5Gbps streams to host1 and host2 via bond0 (both
slaves are used, each having traffic to single host, but only 0.5Gbps).

Is there any reason why the outgoing traffic via bond0 doesn't go over 1Gbps?
The host utilization is well under its limits.

Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-42.0.3.EL


How reproducible:
allways

Steps to Reproduce:
1. create bond0 with 2 e1000 gigabit slaves, balance-alb, mii monitoring
2. generate network traffic to one directly connected host. Watch network
traffic on bond0 and on both its slaves (I observe 1Gbps on bond0 and eth0,
0Gbps on eth1)
3. generate network traffic to another host via bond0. Watch the traffic on
bond0 and it's slaves (I observer 1Gbps on bond0 and 0.5Gbps on each eth0 and eth1)
  
Actual results:
Total traffic doesn't go over 1Gbps

Expected results:
Total traffic should be 2Gbps, with 1Gbps to each client because each host uses
different slave interface)

Additional info:

Comment 1 David Kostal 2007-02-08 10:27:10 UTC
Do you have any idea why is this happening? Are you able to reproduce this
behaviour? Or could this be sort of configuration problem only?

Comment 2 Andy Gospodarek 2007-02-08 16:28:15 UTC
> I am able to receive 2x 1Gbps streams on bond0 (balance-alb) from two other hosts.

This is working as expected for balance-alb, but rr mode would be limited to
1Gbps since the same MAC/IP combo would be used on all systems.

> I am able to send 1Gbps stream to one other host via bond0.

This is working as expected.

> I am able to send 2x 1Gbps streams to host1 via bond0 and host3 via bond1.

This is working as expected.

> I am only able to send 2x 0.5Gbps streams to host1 and host2 via bond0 (both
> slaves are used, each having traffic to single host, but only 0.5Gbps).

Are you using tcp or udp?  Does this change when switching to rr-mode and udp?

> Is there any reason why the outgoing traffic via bond0 doesn't go over 1Gbps?
> The host utilization is well under its limits.

There are not any hard limitations in the driver that cap speeds at 1G.  There
are definitely some limitations one bonding and how much you can transmit and
receive from a single host  -- generally the limitation is on reception since
the switch can't learn the destination MAC on multiple interfaces and stripe the
traffic across them.  This limitation is lifted on 802.3ad, xor, and balance-alb
since the switch can hash different connections over different interfaces, but
each tcp/udp stream (and in alb's case host) will still be limited to the speed
of a slave interface.  



Comment 3 David Kostal 2007-02-09 08:14:00 UTC
I'm using TCP (with alb,tlb and rr).

Actually I have 2 identical setups on two PE2850. My modprobe.conf on host0:
alias bond0 bonding
alias bond1 bonding
options bond0 miimon=50 mode=balance-alb max_bonds=2
options bond1 miimon=50 mode=balance-alb

and on host1:
install bond0 modprobe e1000; modprobe bonding --ignore-install -o bond0 \
        mode=balance-alb miimon=50 primary=eth2
install bond1 modprobe e1000; modprobe bonding --ignore-install -o bond1 \
        mode=balance-alb miimon=50

They are little bit different because I did some testing there (assitgning
different physical interfaces to bonds, parameters for e1000, etc.)

On 2.6.9-42.0.3.EL both behaved the same. Yesterday I upgraded to
2.6.9-42.0.8.ELsmp and host1 still behaves the same (config above), while host0
now behaves correctly, with 2 outgoing connections I get over 1.8Gbps.

The host1 is now using 2 onboard e1000 for bond0 (which I test),  host0 uses one
onboard and one dual-port card in PCI-X slot (eth0+eth2). Bu I do not think that
this is a problem, because I am(was) able to send at full speed with any two
cards out of my four.

I do not know when I'll be able to switch the config on host1 to see whether
different modprobe.conf will help (machines isused by other people too). Is the
configuration of host1 wrong? Actually this is the only way how to have two
different bonding algorithms on bond0 and bond1, AFAIK.



Comment 4 Andy Gospodarek 2007-02-09 20:28:15 UTC
Can you try to use netcat (nc) and use udp traffic?  One problem with tcp is
that you often don't know if the limitation is on rx or tx since tcp will make
the traffic back-off when the maximum throughput can't be reached.  

Glancing at your config it looks fine, though you should probably remove
'modprobe e1000;' from the 'install' line on host1.  

Could you also send the output of /proc/net/bonding/bond0 and
/proc/net/bonding/bond1 on these systems as well?

Comment 5 David Kostal 2007-02-15 14:58:18 UTC
I am now testing woth both udp and tcp, same results.

I upgraded bios on both PE2850s to be the same (A06), no change.
I replugged caples on host1 (not working) to have the same assignements to bonds
as on host0 (working), no change (they were different becaouse of my previous
testing of this issue).
I change /etc/modprobe.conf to be the same as on host0, loading only one
"bonding" with max_bonds=2, no change.

i am now confused because I have two very much same configurations and one is
working as I expect, while the other one is not.


Here is the output of /proc on (working) host0
[root@paris ~]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 50
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:04:23:d8:30:4a

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:13:72:54:99:81
[root@paris ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 50
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:04:23:d8:30:4b

Slave Interface: eth3
MII Status: down
Link Failure Count: 0
Permanent HW addr: 00:13:72:54:99:82
[root@paris ~]# 


And on host1 (not working):
[root@sofia ~]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 50
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:04:23:d8:2c:3a

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:11:43:d4:94:a2
[root@sofia ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 50
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:04:23:d8:2c:3b

Slave Interface: eth3
MII Status: down
Link Failure Count: 0
Permanent HW addr: 00:11:43:d4:94:a3
[root@sofia ~]# 

i'm testing on bond0, of course.

The modprobe.conf on both nodes contains:
alias eth0 e1000
alias eth1 e1000
alias eth2 e1000
alias eth3 e1000
options e1000 FlowControl=1
alias bond0 bonding
alias bond1 bonding
options bond0 miimon=50 mode=balance-alb max_bonds=2
options bond1 miimon=50 mode=balance-alb

kernel on both nodes is 2.6.9-42.0.8.ELsmp

eth0 and eth1 are on PCI-X dual-port network card, eth2 and eth3 are onboard.

None of these two hosts is overloaded when I do the tests.

sysctl.conf is the same on both nodes.



Comment 6 David Kostal 2007-02-15 16:27:47 UTC
It seems the problem is not RH related, but some limitation on Cisco Catalys
4506. If I plug the cables to Catalyst ports which are not close enough
(different blocks of 8 ports, as labeled on the Cisco board), I can get 2x 1Gbps
on both machines.

Please close this as not-a-bug (at least for RedHat:)

Comment 7 Andy Gospodarek 2007-02-15 16:42:06 UTC
Thanks for the update, David.

I'll close this one out, but I'll remember that switches can cause problems
sometimes too! :)