Description of problem: ARP packets aren't received by backup slaves breaking bonding when using arp_validate=3. The upstream is: commit f5b2b966f032f22d3a289045a5afd4afa09f09c6 Author: Jay Vosburgh <fubar.com> Date: Fri Sep 22 21:54:53 2006 -0700 [PATCH] bonding: Validate probe replies in ARP monitor Add logic to check ARP request / reply packets used for ARP monitor link integrity checking. The current method simply examines the slave device to see if it has sent and received traffic; this can be fooled by extraneous traffic. For example, if multiple hosts running bonding are behind a common switch, the probe traffic from the multiple instances of bonding will update the tx/rx times on each other's slave devices. Signed-off-by: Jay Vosburgh <fubar.com> Signed-off-by: Jeff Garzik <jeff> and there is a chunk doing the following: @@ -1025,6 +1026,10 @@ static inline int skb_bond_should_drop(struct sk_buff *skb) if (master && (dev->priv_flags & IFF_SLAVE_INACTIVE)) { + if ((dev->priv_flags & IFF_SLAVE_NEEDARP) && + skb->protocol == __constant_htons(ETH_P_ARP)) + return 0; + This part above is missing on RHEL-5 kernels. Version-Release number of selected component (if applicable): 2.6.18.128.el5 How reproducible: Always Steps to Reproduce: 1. setup bonding using arp_validate=3 2. check backup slave status in /proc/net/bonding/bond0 (down) 3. force a failover. Actual results: backup slaves are always down and if a failover happens it can bounce slaves. Expected results: backup slaves up if the link is okay, failover working. Additional info: Patch tested locally and by customer with good feedback. (RHEL-4 ticket is bz#480237)
Created attachment 331063 [details] proposed patch
Updating PM score.
test kernel rpms here: http://people.redhat.com/jpirko/test/bz484304/
I've set up little testing environment. I have following: Host1: 2x Realtek 8139 NIC running Ubuntu 8.04.2 (10.22.33.1/24) (it has this patch in) Host2: 2x Realtek 8239 NIC running RHEL5.3 (10.22.33.2/24) These 2 machines are connected via two crossed UTPs, eth1 to eth1, eth2 to eth2. I have bonding set up as following: Host1: # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.2.3 (December 6, 2007) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth2 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 ARP Polling Interval (ms): 60 ARP IP target/s (n.n.n.n form): 10.22.33.2 Slave Interface: eth1 MII Status: down Link Failure Count: 1 Permanent HW addr: 00:a1:b0:00:42:16 Slave Interface: eth2 MII Status: up Link Failure Count: 1 Permanent HW addr: 00:05:1d:ee:9f:31 --- Host2: # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth2 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 ARP Polling Interval (ms): 60 ARP IP target/s (n.n.n.n form): 10.22.33.1 Slave Interface: eth1 MII Status: down Link Failure Count: 1 Permanent HW addr: 00:1f:1f:01:2f:22 Slave Interface: eth2 MII Status: up Link Failure Count: 1 Permanent HW addr: 00:1f:1f:01:17:69 --- Now if I unplug the cable from eth2 (no matter what host), on both hosts eth1 goes to state up and eth2 to down. When running ping at the same time I only loose few packets. I'm trying this with 128.1.1 kernel and patched kernel I posted link to in comment #3 and I have exactly the same result. I'm going to try this with the upstream kernel. If I understand it correctly, both eth1 and eth2 should be in state up? Am I doing something wrong?
ad comment #4 - yes, upstream kernel behaviour is the same
ad comment #5 So I changed the configuration of the network. I put the switch in between, so now I have all four NICs of both hosts connected into the switch. Behaviour of the upstream kernel, ubuntu kernel and my patched one changed. Now both slaves are in state "up". For original 128.1.1, inactive slave is in state "down". So I have successfully tested this patch works fine.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-133.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
*** Bug 453113 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html