Bug 226960

Summary: bonding with arp monitoring unreliable
Product: Red Hat Enterprise Linux 4 Reporter: nicholas <nicholas>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.4CC: jbaron, kajtzu, linville, peterm, tgraf
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-02-12 14:23:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nicholas 2007-02-02 09:37:38 UTC
Description of problem:
eth0 and eth1 are connected to independent unmanaged switches through backbone
to gateway. When the uplink of the switches are broken only 1 interface detects it:
bonding: bond0: backup interface eth0 is now down
bonding: bond0: backup interface eth0 is now up

If bonding module is reloaded it's 50/50 chance it's this or the other interface
detecting upstream link breakage.

Consequence: high availability is reduced, since if  uplink in the unmonitored
path breaks the operating system does not detect it, and system goes offline.

Manually failing over active interface works as expected:
real:~# ifenslave -c bond0 eth0
real:~# fping ping.uio.no
ping.uio.no is alive
real:~# ifenslave -c bond0 eth1
real:~# fping ping.uio.no
ping.uio.no is alive

Im thinking it's a) the lack of arp_validate in the current redhat kernel, since
there is a lot of arp traffic on these unmanaged swithces even when they have no
uplink, 
http://article.gmane.org/gmane.linux.kernel.commits.head/87685/match=bonding+arp
or it is some specific x86_64 bug like
http://article.gmane.org/gmane.linux.kernel.commits.mm/3877/match=arp+monitoring+broken+x86+64

Version-Release number of selected component (if applicable):
real:~# uname -a
Linux real 2.6.9-42.0.8.ELsmp #1 SMP Tue Jan 23 12:49:51 EST 2007 x86_64 x86_64
x86_64 GNU/Linux


How reproducible:
always

Steps to Reproduce:
1. break uplink from unmanaged switch a
2. wait for bonding code reaction

  
Actual results:
No response on 1 of 2 interfaces

Expected results:
bonding code response on both interfaces

Additional info:

configuration
real:~# cat /etc/modprobe.conf
alias bond0 bonding
options bond0 mode=active-backup arp_interval=1000 arp_ip_target= 80.232.38.65 \ 
primary=eth1
alias eth0 e1000
alias eth1 e1000

real:~# cat /etc/sysconfig/network-scripts/ifcfg-bond0 
DEVICE=bond0
BOOTPROTO=static
BROADCAST=80.232.38.127
IPADDR=80.232.38.80
NETMASK=255.255.255.192
NETWORK=80.232.38.64
GATEWAY=80.232.38.65
ONBOOT=yes
BONDING_MASTER="yes"
BONDING_SLAVE0="eth0"
BONDING_SLAVE1="eth1" 

real:~# cat /etc/sysconfig/network-scripts/ifcfg-eth[0,1]
DEVICE=eth0
USERCTL=no
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
DEVICE=eth1
USERCTL=no
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
real:~# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth1
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 4
Permanent HW addr: 00:14:4f:01:90:80

Slave Interface: eth1
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:14:4f:01:90:81

Please let me know if I can provide any more details or if there is some more
testing I should do.

Comment 1 Chuck Ebbert 2007-02-05 23:00:57 UTC
Just happened to see this:

http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f8a8ccd56d82bd4f4b5c7c2e7eb758c7764d98e1

Looks like the bug's been around for a long time.


Comment 2 Andy Gospodarek 2007-02-09 20:47:53 UTC
I backported the arp validation code a while back and in the process discovered
the x86_84 problem (that really only was a problem when using the arp-validate
code).  The base functionality and fix for this is already included in test
kernels here:

http://people.redhat.com/agospoda/#rhel4

Please give them a try and report back the results.


Comment 3 nicholas 2007-02-12 09:12:30 UTC
With this kernel the system in question notices uplink breakage on both links.

Bug fixed :-D

real:~# cat /etc/modprobe.conf | grep bond0  
alias bond0 bonding
options bond0 mode=active-backup arp_interval=1000 arp_ip_target=80.232.38.65
primary=eth1 arp_validate=all
real:~# uname -a
Linux real.nhst.kunder.linpro.no 2.6.9-45.EL.gtest.9smp #1 SMP Thu Feb 1
13:33:00 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

bonding: bond0: making interface eth0 the new active one.
bonding: bond0: changing from interface eth0 to primary interface eth1
bonding: bond0: making interface eth1 the new active one.
bonding: bond0: link status down for active interface eth1, disabling it
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: backup interface eth1 is now up
bonding: bond0: changing from interface eth0 to primary interface eth1
bonding: bond0: making interface eth1 the new active one.
bonding: bond0: backup interface eth0 is now down
bonding: bond0: backup interface eth0 is now up


Comment 4 Andy Gospodarek 2007-02-12 14:09:25 UTC
Great!  Thanks for the feedback.

Comment 5 Andy Gospodarek 2007-02-12 14:23:45 UTC
*** This bug has been marked as a duplicate of 223100 ***

*** This bug has been marked as a duplicate of 223100 ***