From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031007 Firebird/0.7 Description of problem: I have a serious problem about the bonding! Customer have deployed the bonding to fail over the network connection. But, the bonding is not perfect to recover a connection after disconnecting! example) Step1) eth0 : connected, eth1 : connected Step2) eth0 : disconnected, eth1: connected Step3) eth0 : recovering the connection, eth1: connected ==> *the "bond0" showed about 50% packets loss during 20-30secs.* Why did the "bond0" lose about 50% packets during 20-30sec after re-connecting the NIC cable? --- additional story ------------------------------- I have tried to set the network bonding on RHEL3 as below. And, I made a test for a fault tolerance of network connection. Step1) eth4: connected eth5: connected Step2) eth4: disconnected eth5: connected Step3) eth4: connected eth5: connected Step4) eth4: connected eth5: disconnected Step5) eth4: connected eth5: connected Curious result was shown that some packet loss at S3 and S5 during about 20-30secs.After recovering the network connection, the packet loss were taken over.And, any packet loss during 20-30secs will make a terrible problem to customer who have been tring to deploy RHEL3 into Telecommunication Equipments. So, I tried to change mode=0,1,2,3,4,5 and 6. But, I cannot find the mode not to loss any packets. And, I tried to chage downdelay and updelay also. *Finally, I can hard to trust the bonding.* Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.disconnecting the network cable from NIC. 2.re-connecting the network cable to NIC. 3.during recovering the connection, about 50% packets are lost Additional info:
Thats really strange. For RHEL3 we (Rik for kernel and myself for userland) made very sure that the whole ipbonding parts were updated to the latest version. I have had several very successfull reports of bonding at customers and even used it myself a couple of times. And the failover usually goes very fast with default settings (meaning no special parameter tuning). I'll conctact one of our consulting guys who set up a bonding and failover solution at a customer and let you know what exactly he did. My only guess is that the drivers used are buggy. So the setup seems to work fine, but during operation you have problems which also leads me to belive that it's more of a kernel problem than a userland problem. Userland isn't involved at all after the setup is done, so i'm reassigning this bug to kernel. Read ya, Phil PS: Please add for the kernel folks the excact hardware you run your tests on, that might already give them a clue.
Hello, please add the relevant hardware/controller information to this big report. Thank you.
The Network Controller is Intel PWLA8492MT(Dual Port). Cisco3750 : Gigabit switch, using RJ-45 connector. Currenttly, I have found the best mode, active-backup with updelay=20000! Even though packet loss has happened, the lost was one or two packet(s). Other modes are show the long failover time, yet.
I definitely get poor/inconsistent behaviour w/ bonding and this card using the RHEL3 U1 kernel. Later kernels seem to work a lot better. Can you recreate this problem using a later kernel? (e.g. RHEL3 U3)
A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.3.EL).
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html