Bug 484304

Summary: [RHEL-5.3] ARP packets aren't received by backup slaves breaking arp_validate=3
Product: Red Hat Enterprise Linux 5 Reporter: Flavio Leitner <fleitner>
Component: kernelAssignee: Jiri Pirko <jpirko>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.3CC: agospoda, ahecox, dhoward, jplans, pbatkowski, qcai, rkhan, tao, tgraf, yann.le-vot
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:08:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 488064    
Attachments:
Description Flags
proposed patch none

Description Flavio Leitner 2009-02-05 23:02:15 UTC
Description of problem:
ARP packets aren't received by backup slaves breaking bonding when using 
arp_validate=3.

The upstream is:
commit f5b2b966f032f22d3a289045a5afd4afa09f09c6
Author: Jay Vosburgh <fubar.com>
Date:   Fri Sep 22 21:54:53 2006 -0700

    [PATCH] bonding: Validate probe replies in ARP monitor

        Add logic to check ARP request / reply packets used for ARP
    monitor link integrity checking.

        The current method simply examines the slave device to see if it
    has sent and received traffic; this can be fooled by extraneous traffic.
    For example, if multiple hosts running bonding are behind a common
    switch, the probe traffic from the multiple instances of bonding will
    update the tx/rx times on each other's slave devices.

    Signed-off-by: Jay Vosburgh <fubar.com>
    Signed-off-by: Jeff Garzik <jeff>

and there is a chunk doing the following:
@@ -1025,6 +1026,10 @@ static inline int skb_bond_should_drop(struct sk_buff *skb)

        if (master &&
            (dev->priv_flags & IFF_SLAVE_INACTIVE)) {
+               if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
+                   skb->protocol == __constant_htons(ETH_P_ARP))
+                       return 0;
+


This part above is missing on RHEL-5 kernels.

Version-Release number of selected component (if applicable):
2.6.18.128.el5

How reproducible:
Always

Steps to Reproduce:
1. setup bonding using arp_validate=3
2. check backup slave status in /proc/net/bonding/bond0 (down)
3. force a failover.
  
Actual results:
backup slaves are always down and if a failover happens it can bounce slaves.

Expected results:
backup slaves up if the link is okay, failover working.

Additional info:
Patch tested locally and by customer with good feedback.
(RHEL-4 ticket is bz#480237)

Comment 1 Flavio Leitner 2009-02-05 23:05:18 UTC
Created attachment 331063 [details]
proposed patch

Comment 2 RHEL Program Management 2009-02-16 15:34:24 UTC
Updating PM score.

Comment 3 Jiri Pirko 2009-02-23 14:08:02 UTC
test kernel rpms here: http://people.redhat.com/jpirko/test/bz484304/

Comment 5 Jiri Pirko 2009-02-24 14:58:49 UTC
I've set up little testing environment. I have following:
Host1: 2x Realtek 8139 NIC running Ubuntu 8.04.2 (10.22.33.1/24) (it has this patch in)
Host2: 2x Realtek 8239 NIC running RHEL5.3 (10.22.33.2/24)
These 2 machines are connected via two crossed UTPs, eth1 to eth1, eth2 to eth2.

I have bonding set up as following:
Host1:
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.3 (December 6, 2007)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 60
ARP IP target/s (n.n.n.n form): 10.22.33.2

Slave Interface: eth1
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:a1:b0:00:42:16

Slave Interface: eth2
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:05:1d:ee:9f:31
---

Host2:
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 60
ARP IP target/s (n.n.n.n form): 10.22.33.1

Slave Interface: eth1
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:1f:1f:01:2f:22

Slave Interface: eth2
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1f:1f:01:17:69
---

Now if I unplug the cable from eth2 (no matter what host), on both hosts eth1 goes to state up and eth2 to down. When running ping at the same time I only loose few packets. I'm trying this with 128.1.1 kernel and patched kernel I posted link to in comment #3 and I have exactly the same result. I'm going to try this with the upstream kernel. If I understand it correctly, both eth1 and eth2 should be in state up? Am I doing something wrong?

Comment 6 Jiri Pirko 2009-02-24 17:50:00 UTC
ad comment #4 - yes, upstream kernel behaviour is the same

Comment 7 Jiri Pirko 2009-02-25 11:30:19 UTC
ad comment #5 

So I changed the configuration of the network. I put the switch in between, so now I have all four NICs of both hosts connected into the switch. Behaviour of the upstream kernel, ubuntu kernel and my patched one changed. Now both slaves are in state "up". For original 128.1.1, inactive slave is in state "down". So I have successfully tested this patch works fine.

Comment 9 RHEL Program Management 2009-02-27 14:28:41 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 11 Don Zickus 2009-03-04 20:01:49 UTC
in kernel-2.6.18-133.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 15 Neil Horman 2009-08-07 19:30:30 UTC
*** Bug 453113 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2009-09-02 08:08:38 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html