Bug 236769 - bonding driver fails to forward traffic after link down event
Summary: bonding driver fails to forward traffic after link down event
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.2
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Andy Gospodarek
QA Contact: Martin Jenner
URL:
Whiteboard:
: 236770 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-04-17 16:04 UTC by Jon Stanley
Modified: 2014-06-29 22:58 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-04-07 15:24:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jon Stanley 2007-04-17 16:04:05 UTC
We are using NIC bonding under the following configuration: 
 
eth0 = first port of first Intel Quad port Gigabit Ethernet controller. 
eth4 = first port of second Intel Quad port Gigabit Ethernet controller. 
bond0 = bonded master interface for eth0+eth4. 
 
When eth0/eth4 cables were disconnected and reconnected to same ports on a 
running system, we experienced lost of network connectivity to the bond0 
interface. 
 
Upon cable reconnect, ethtool showed valid link status as well as correct speed 
& duplex settings for the slave interfaces. (1000/FDX). 
 
bond0 would not recognize a valid slave interface after reconnect. 
 
"grep 'Active Slave' /proc/net/bonding/bond0" produced the following result: 
Currently Active Slave: None. 
 
running "ifconfig <if> down" and "ifconfig <if> up" for each slave interface 
eth0 and eth4 failed to affect the result. 
 
"ifconfig bond0" showed the correct IP address assigned to the bond0 interface, 
but IP traffic failed to reach the system over bond0. 

A reboot of the servers restored service.

Comment 1 Jason Baron 2007-06-15 19:06:15 UTC
*** Bug 236770 has been marked as a duplicate of this bug. ***

Comment 2 Andy Gospodarek 2007-06-15 20:30:26 UTC
Have you tried this with any later kernels?  There have been quite a few bonding
fixes since U2:

-bonding: link status not always reported correctly (Andy Gospodarek) [212392]
-bonding: fix primary interface initialization problem with active-backup bond
(Andy Gospodarek) [208362]
-bonding: use signed type to catch return code from ->get_settings (John
Linville) [196068]
-bonding: back-out sysfs updates (John Linville) [194410]
-Introduce netpoll over bonded interfaces (Thomas Graf) [174184 126164 190162
146164]
-bonding: allow vlan traffic over bond (John Linville) [174671]
-fix race in net bonding driver (Kimball Murray) [188296]

Would you be willing to try one of my test kernels?

http://people.redhat.com/agospoda/#rhel4

I'm not surprised that 'ifdown/up' on the individual slave interfaces didn't
make a difference after reconnecting the cables.  I would be curious about what
happened when running 'ifdown/up' on the bond0 interface after failure.

I'd also be curious if this is a problem when using 2 ports on one of the cards
rather than one port on each card.  

I know neither of these is a reasonable long-term solution, but I'd like to
understand the problem better and am curious if either change makes a difference.

Comment 3 Jon Stanley 2007-06-15 20:49:15 UTC
Well those are extremely good questions :-).  Unfortuantely, I cannot reproduce 
the issue at will - there is some factor causing it that must be separate from 
the link down events (though that is defintely what triggers it).  The problem 
has turned out not to be reproducible in my lab environment, where I would be 
more than willing to try the test kernels.  In production, however, I can't 
really try a test kernel.

How this was seen was via a physical cable audit in one of the datacenters that 
has about 200 hosts.  On 12 of the hosts, we saw this behavior.

It's accompianed by a flurry of messages in syslog (once per second) that 
notes 'backup interface eth4 is now up'

Comment 4 Andy Gospodarek 2008-04-07 15:24:47 UTC
Jon, this bug has seen no activity in quite a while, so I can only presume it is
no longer a problem.  If there is still an issue that needs to be resolved,
please re-open this bug and I will be happy to help resolve it.  Thank you.


Comment 5 Jon Stanley 2008-04-07 16:56:22 UTC
Yep, the bug certainly went stale.  Like I said, I'm not able to reproduce it at
will.  If I find a way, I'll come back here :)  Did my Fedora cleanup have
anything to do with this?  Strange timing :).

Comment 6 Andy Gospodarek 2008-04-07 21:00:37 UTC
Jon, your fedora cleanup might have helped motivate me, but trolling through all
my BZ's that are stale has been on my list for a while -- I just finally got out
from under my patch backlog and was able to do it. :)


Note You need to log in before you can comment on or make changes to this bug.