114355 – Bonding have been taking LONG TIME to FAIL OVER!

Bug 114355 - Bonding have been taking LONG TIME to FAIL OVER!

Summary: Bonding have been taking LONG TIME to FAIL OVER!

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	John W. Linville
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-01-27 02:40 UTC by Ha, Hoe Seong
Modified:	2007-11-30 22:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-12-20 20:54:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:550	0	normal	SHIPPED_LIVE	Updated kernel packages available for Red Hat Enterprise Linux 3 Update 4	2004-12-20 05:00:00 UTC

Description Ha, Hoe Seong 2004-01-27 02:40:48 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5)
Gecko/20031007 Firebird/0.7

Description of problem:
I have a serious problem about the bonding!

Customer have deployed the bonding to fail over the network connection.
But, the bonding is not perfect to recover a connection after
disconnecting!

example)
   Step1) eth0 : connected, eth1 : connected
   Step2) eth0 : disconnected, eth1: connected
   Step3) eth0 : recovering the connection, eth1: connected
      ==> *the "bond0" showed about 50% packets loss during 20-30secs.*

Why did the "bond0" lose about 50% packets during 20-30sec after
re-connecting the NIC cable? 


--- additional story -------------------------------

I have tried to set the network bonding on RHEL3 as below.
And, I made a test for a fault tolerance of network connection.
   Step1) eth4: connected      eth5: connected
   Step2) eth4: disconnected eth5: connected
   Step3) eth4: connected      eth5: connected
   Step4) eth4: connected      eth5: disconnected
   Step5) eth4: connected      eth5: connected

Curious result was shown that some packet loss at S3 and S5 during
about 20-30secs.After recovering the network connection, the packet
loss were taken over.And, any packet loss during 20-30secs will make a
terrible problem to customer who have been tring to deploy RHEL3 into
Telecommunication Equipments.

So, I tried to change mode=0,1,2,3,4,5 and 6. But, I cannot find the
mode not to loss any packets. And, I tried to chage downdelay and
updelay also. 

*Finally, I can hard to trust the bonding.*

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.disconnecting the network cable from NIC.
2.re-connecting the network cable to NIC.
3.during recovering the connection, about 50% packets are lost
    

Additional info:

Comment 1 Phil Knirsch 2004-01-30 13:23:28 UTC

Thats really strange.

For RHEL3 we (Rik for kernel and myself for userland) made very sure
that the whole ipbonding parts were updated to the latest version.

I have had several very successfull reports of bonding at customers
and even used it myself a couple of times. And the failover usually
goes very fast with default settings (meaning no special parameter
tuning).

I'll conctact one of our consulting guys who set up a bonding and
failover solution at a customer and let you know what exactly he did.

My only guess is that the drivers used are buggy. So the setup seems
to work fine, but during operation you have problems which also leads
me to belive that it's more of a kernel problem than a userland
problem. Userland isn't involved at all after the setup is done, so
i'm reassigning this bug to kernel.

Read ya, Phil

PS: Please add for the kernel folks the excact hardware you run your
tests on, that might already give them a clue.

Comment 2 Ernie Petrides 2004-03-30 03:01:02 UTC

Hello, please add the relevant hardware/controller information
to this big report.  Thank you.

Comment 3 Ha, Hoe Seong 2004-03-30 03:24:37 UTC

The Network Controller is Intel PWLA8492MT(Dual Port).
Cisco3750 : Gigabit switch, using RJ-45 connector.

Currenttly, I have found the best mode, active-backup with updelay=20000! 
Even though packet loss has happened, the lost was one 
or two packet(s).

Other modes are show the long failover time, yet.

Comment 4 John W. Linville 2004-08-06 18:42:09 UTC

I definitely get poor/inconsistent behaviour w/ bonding and this card
using the RHEL3 U1 kernel.  Later kernels seem to work a lot better.

Can you recreate this problem using a later kernel? (e.g. RHEL3 U3)

Comment 5 Ernie Petrides 2004-09-04 00:54:25 UTC

A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.3.EL).

Comment 6 John Flanagan 2004-12-20 20:54:51 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html

Note You need to log in before you can comment on or make changes to this bug.