Bug 610234 - [5u6] Bonding in ALB mode sends ARP in loop
Summary: [5u6] Bonding in ALB mode sends ARP in loop
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.6
Hardware: All
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Flavio Leitner
QA Contact: Liang Zheng
URL:
Whiteboard:
Depends On:
Blocks: 610236 610237 623143
TreeView+ depends on / blocked
 
Reported: 2010-07-01 19:51 UTC by Flavio Leitner
Modified: 2018-10-27 13:25 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 610236 610237 (view as bug list)
Environment:
Last Closed: 2011-01-13 21:40:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Flavio Leitner 2010-07-01 19:51:06 UTC
Description of problem:
When two systems using bonding devices in adaptive load
balancing (ALB) communicates with each other, an endless
ping-pong of ARP replies starts between these two systems.

What happens? In the ALB mode, bonding driver keeps track
of each client connected in a hash table, so it can do the
receive load balancing (RLB). This hash table is updated
when an ARP reply is received, then it scans for the client
entry, updates its MAC address and flag it to be announced
later. Therefore, two seconds later, the alb monitor runs
and send for each updated client entry two ARP replies
updating this specific client. The same process happens on
the receiving system, causing the endless ping-pong of arp
replies.

See more information including the relevant functions below:

   System 1                          System 2
    bond0                             bond0

   ping <system2>
    ARP request  --------->
                           <--------- ARP reply

+->rlb_arp_recv  <---------------------+   <--- loop begins
|  rlb_update_entry_from_arp           |
|  client_info->ntt = 1;               |
|  bond_info->rx_ntt = 1;              |
|                                      |
|         <communication succeed>      |
|                                      |
|  bond_alb_monitor                    |
|  rlb_update_rx_clients               |
|  rlb_update_client                   |
|  arp_create(ARPOP_REPLY)             |
|   send ARP reply -------------->     V
|   send ARP reply -------------->
|                               rlb_arp_recv
|                               rlb_update_entry_from_arp
|                               client_info->ntt = 1;
|                               bond_info->rx_ntt = 1;
|                           < snipped, same as in system 1>
+-------           <-------------- send ARP reply
                   <-------------- send ARP reply

Besides the unneeded networking traffic, this loop breaks
a cluster because a backup system can't take over the IP
address. There is always one system sending an ARP reply
poisoning the network.

This patch fixes the problem adding a check for the MAC
address before updating it. Thus, if the MAC address didn't
change, there is no need to update neither to announce it later.

Version-Release number of selected component (if applicable):
5.6

How reproducible:
Always

Steps to Reproduce:
Just ping one system from another with both running bonding in ALB mode and you can see the ARP being sent each 2 seconds between them.
  
Actual results:
It is impossible for a backup system to take over the IP address when there are other systems sending wrong ARP packets poisoning the network.

Expected results:
Don't send unneeded ARP packets

Additional info:
The patch is available upstream:
http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=42d782ac1bef7cbcdf05b857731345c6e8149f90

The patch is backported for RHEL5 and we are waiting for testing results.

Comment 1 Andy Gospodarek 2010-07-16 17:34:12 UTC
Flavio, you did the leg work on this and you earned the internal credit too. 
Please post this when you get positive feedback.

Comment 4 Issue Tracker 2010-08-06 21:02:26 UTC
Event posted on 08-06-2010 05:02pm EDT by jnevill

Flavio,

We have confirmation from my customer that the patched 5.5 kernel
(2.6.18-194.el5.it1020633) resolved the issue for them.

They are also testing the RHEL 4.8 (2.6.9-89.it1020633) kernel and should
provide feedback next week.

- Justin



This event sent from IssueTracker by jnevill 
 issue 1020633

Comment 6 RHEL Program Management 2010-08-10 14:21:06 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 12 Jarod Wilson 2010-08-13 02:24:11 UTC
in kernel-2.6.18-212.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 19 errata-xmlrpc 2011-01-13 21:40:29 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.