Bug 484304 - [RHEL-5.3] ARP packets aren't received by backup slaves breaking arp_validate=3
Summary: [RHEL-5.3] ARP packets aren't received by backup slaves breaking arp_validate=3
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jiri Pirko
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
: 453113 (view as bug list)
Depends On:
Blocks: 488064
TreeView+ depends on / blocked
 
Reported: 2009-02-05 23:02 UTC by Flavio Leitner
Modified: 2018-10-20 02:49 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 08:08:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
proposed patch (1.94 KB, patch)
2009-02-05 23:05 UTC, Flavio Leitner
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 08:53:34 UTC

Description Flavio Leitner 2009-02-05 23:02:15 UTC
Description of problem:
ARP packets aren't received by backup slaves breaking bonding when using 
arp_validate=3.

The upstream is:
commit f5b2b966f032f22d3a289045a5afd4afa09f09c6
Author: Jay Vosburgh <fubar.com>
Date:   Fri Sep 22 21:54:53 2006 -0700

    [PATCH] bonding: Validate probe replies in ARP monitor

        Add logic to check ARP request / reply packets used for ARP
    monitor link integrity checking.

        The current method simply examines the slave device to see if it
    has sent and received traffic; this can be fooled by extraneous traffic.
    For example, if multiple hosts running bonding are behind a common
    switch, the probe traffic from the multiple instances of bonding will
    update the tx/rx times on each other's slave devices.

    Signed-off-by: Jay Vosburgh <fubar.com>
    Signed-off-by: Jeff Garzik <jeff>

and there is a chunk doing the following:
@@ -1025,6 +1026,10 @@ static inline int skb_bond_should_drop(struct sk_buff *skb)

        if (master &&
            (dev->priv_flags & IFF_SLAVE_INACTIVE)) {
+               if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
+                   skb->protocol == __constant_htons(ETH_P_ARP))
+                       return 0;
+


This part above is missing on RHEL-5 kernels.

Version-Release number of selected component (if applicable):
2.6.18.128.el5

How reproducible:
Always

Steps to Reproduce:
1. setup bonding using arp_validate=3
2. check backup slave status in /proc/net/bonding/bond0 (down)
3. force a failover.
  
Actual results:
backup slaves are always down and if a failover happens it can bounce slaves.

Expected results:
backup slaves up if the link is okay, failover working.

Additional info:
Patch tested locally and by customer with good feedback.
(RHEL-4 ticket is bz#480237)

Comment 1 Flavio Leitner 2009-02-05 23:05:18 UTC
Created attachment 331063 [details]
proposed patch

Comment 2 RHEL Program Management 2009-02-16 15:34:24 UTC
Updating PM score.

Comment 3 Jiri Pirko 2009-02-23 14:08:02 UTC
test kernel rpms here: http://people.redhat.com/jpirko/test/bz484304/

Comment 5 Jiri Pirko 2009-02-24 14:58:49 UTC
I've set up little testing environment. I have following:
Host1: 2x Realtek 8139 NIC running Ubuntu 8.04.2 (10.22.33.1/24) (it has this patch in)
Host2: 2x Realtek 8239 NIC running RHEL5.3 (10.22.33.2/24)
These 2 machines are connected via two crossed UTPs, eth1 to eth1, eth2 to eth2.

I have bonding set up as following:
Host1:
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.3 (December 6, 2007)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 60
ARP IP target/s (n.n.n.n form): 10.22.33.2

Slave Interface: eth1
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:a1:b0:00:42:16

Slave Interface: eth2
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:05:1d:ee:9f:31
---

Host2:
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 60
ARP IP target/s (n.n.n.n form): 10.22.33.1

Slave Interface: eth1
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:1f:1f:01:2f:22

Slave Interface: eth2
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1f:1f:01:17:69
---

Now if I unplug the cable from eth2 (no matter what host), on both hosts eth1 goes to state up and eth2 to down. When running ping at the same time I only loose few packets. I'm trying this with 128.1.1 kernel and patched kernel I posted link to in comment #3 and I have exactly the same result. I'm going to try this with the upstream kernel. If I understand it correctly, both eth1 and eth2 should be in state up? Am I doing something wrong?

Comment 6 Jiri Pirko 2009-02-24 17:50:00 UTC
ad comment #4 - yes, upstream kernel behaviour is the same

Comment 7 Jiri Pirko 2009-02-25 11:30:19 UTC
ad comment #5 

So I changed the configuration of the network. I put the switch in between, so now I have all four NICs of both hosts connected into the switch. Behaviour of the upstream kernel, ubuntu kernel and my patched one changed. Now both slaves are in state "up". For original 128.1.1, inactive slave is in state "down". So I have successfully tested this patch works fine.

Comment 9 RHEL Program Management 2009-02-27 14:28:41 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 11 Don Zickus 2009-03-04 20:01:49 UTC
in kernel-2.6.18-133.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 15 Neil Horman 2009-08-07 19:30:30 UTC
*** Bug 453113 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2009-09-02 08:08:38 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html


Note You need to log in before you can comment on or make changes to this bug.