98464 – bonding 802.3ad long failover time under heavy stress

Bug 98464 - bonding 802.3ad long failover time under heavy stress

Summary: bonding 802.3ad long failover time under heavy stress

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jeff Garzik
QA Contact:	Brian Brock
Docs Contact:
URL:	http://sourceforge.net/projects/bonding/
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-07-02 18:04 UTC by Need Real Name
Modified:	2013-07-03 02:12 UTC (History)
CC List:	3 users (show)
Fixed In Version:	2.4.21-1.1931.2.349.2.2.ent
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2003-08-03 13:55:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Need Real Name 2003-07-02 18:04:49 UTC

Description of problem:
When using bonding in 802.3ad mode with very heavy stress, removing the last 
slave of the active aggregator might result in long failover to another 
aggregator (upto 90 sec.) due to LACPDU packets being dropped from the slaves 
tx queue. The solution is to send such packets with the highest priority. The 
solution was verified to work with 10/100/1000Mbps adapters, but might not be 
good enough when using 10/100 adapter - in which case it will be necessary to 
wait the entire timeout defind by the IEEE standard.

Version-Release number of selected component (if applicable):
kernel-2.4.20-1.1931.2.231.2.11.ent

How reproducible:
Configure a bonding team in 802.3ad mode with 2 or more gig adapters and 2 or 
more 10/100 adapters. Start heavy bi-directional stress traffic between the 
server and the clients and remove the gig slaves from the bond one by one. Once 
the last gig slave is removed, traffic may stall until the new aggregator is 
selected (may vary between switches).

Steps to Reproduce:
1. insmod bonding mode=4
2. ifconfig bond0 <ip-addr>
3. ifenslave bond0 eth0 eth1 eth2 eth5 eth6 eth7
4. run stress traffic (e.g. iperf, netperf, etc.)
5. do ifenslave -c on all gig slaves
6. monitor traffic on remaining slaves
    
Actual results:
Traffic stalls for up to 90 sec. until a new aggregator is selected.

Expected results:
Traffic restarts immediately using another aggregator.

Additional info:
A bug fix patch was sent by me on June 26th to bond-devel, linux-net and linux-
netdev lists. It was already accepted by Jeff Garzik into his net-drivers-2.4 
BK tree. It is a further fix for the original problem reported by Jay Vosburgh 
regarding the same problem without stress traffic.

Comment 1 Larry Troan 2003-07-16 13:55:44 UTC

ISSUE TRACKER 25887 OPENED AS SEV 1

Comment 2 Rik van Riel 2003-07-16 13:59:11 UTC

Jeff, do we have patches for this in Taroon already or are they still in your
queue ?

Comment 3 Need Real Name 2003-07-23 17:14:01 UTC

Appears to be fix implemented in RHEL 3 B1 candidate kernel (version 2.4.21-
1.1931.2.349.2.2.ent).

Note You need to log in before you can comment on or make changes to this bug.