We are using NIC bonding under the following configuration: eth0 = first port of first Intel Quad port Gigabit Ethernet controller. eth4 = first port of second Intel Quad port Gigabit Ethernet controller. bond0 = bonded master interface for eth0+eth4. When eth0/eth4 cables were disconnected and reconnected to same ports on a running system, we experienced lost of network connectivity to the bond0 interface. Upon cable reconnect, ethtool showed valid link status as well as correct speed & duplex settings for the slave interfaces. (1000/FDX). bond0 would not recognize a valid slave interface after reconnect. "grep 'Active Slave' /proc/net/bonding/bond0" produced the following result: Currently Active Slave: None. running "ifconfig <if> down" and "ifconfig <if> up" for each slave interface eth0 and eth4 failed to affect the result. "ifconfig bond0" showed the correct IP address assigned to the bond0 interface, but IP traffic failed to reach the system over bond0. A reboot of the servers restored service.
*** Bug 236770 has been marked as a duplicate of this bug. ***
Have you tried this with any later kernels? There have been quite a few bonding fixes since U2: -bonding: link status not always reported correctly (Andy Gospodarek) [212392] -bonding: fix primary interface initialization problem with active-backup bond (Andy Gospodarek) [208362] -bonding: use signed type to catch return code from ->get_settings (John Linville) [196068] -bonding: back-out sysfs updates (John Linville) [194410] -Introduce netpoll over bonded interfaces (Thomas Graf) [174184 126164 190162 146164] -bonding: allow vlan traffic over bond (John Linville) [174671] -fix race in net bonding driver (Kimball Murray) [188296] Would you be willing to try one of my test kernels? http://people.redhat.com/agospoda/#rhel4 I'm not surprised that 'ifdown/up' on the individual slave interfaces didn't make a difference after reconnecting the cables. I would be curious about what happened when running 'ifdown/up' on the bond0 interface after failure. I'd also be curious if this is a problem when using 2 ports on one of the cards rather than one port on each card. I know neither of these is a reasonable long-term solution, but I'd like to understand the problem better and am curious if either change makes a difference.
Well those are extremely good questions :-). Unfortuantely, I cannot reproduce the issue at will - there is some factor causing it that must be separate from the link down events (though that is defintely what triggers it). The problem has turned out not to be reproducible in my lab environment, where I would be more than willing to try the test kernels. In production, however, I can't really try a test kernel. How this was seen was via a physical cable audit in one of the datacenters that has about 200 hosts. On 12 of the hosts, we saw this behavior. It's accompianed by a flurry of messages in syslog (once per second) that notes 'backup interface eth4 is now up'
Jon, this bug has seen no activity in quite a while, so I can only presume it is no longer a problem. If there is still an issue that needs to be resolved, please re-open this bug and I will be happy to help resolve it. Thank you.
Yep, the bug certainly went stale. Like I said, I'm not able to reproduce it at will. If I find a way, I'll come back here :) Did my Fedora cleanup have anything to do with this? Strange timing :).
Jon, your fedora cleanup might have helped motivate me, but trolling through all my BZ's that are stale has been on my list for a while -- I just finally got out from under my patch backlog and was able to do it. :)