Description of problem:
The customer has observed a problem with their routers and layer 3 switches
updating their arp tables when a failover event occurs (ie the active Piranha
router fails over and the backup takes over).
In attempt to troubleshoot this themselves, they collected tcpdumps and noticed
that the gratuitous arps for the virtual IP's the backup Piranha node is "taking
over" did not seem to be formed correctly, and they believed this to be the
reason the routers and other network equipment were not updating their arp
tables when the backup Piranha LVS router takes over the load balancing.
At the core of this issue seems to be a problem with Piranha generating
gratuitous arps which are formed in an RFC-compliant manner. I examined RFC
2002 section 4.6, which seems to be the defining document for gratuitous arp
behavior -- I wouldn't doubt this being defined prior to 1996, I just couldn't
find an earlier ARP-related RFC that defined gratuitous arp behavior. Anyway, I
compared what is defined in this RFC to the arp traffic in the tcpdump provided
by the customer.
RFC 2002 section 4.6 says:
- A Gratuitous ARP  is an ARP packet sent by a node in order to
spontaneously cause other nodes to update an entry in their ARP
cache. A gratuitous ARP MAY use either an ARP Request or an ARP
Reply packet. In either case, the ARP Sender Protocol Address
and ARP Target Protocol Address are both set to the IP address
of the cache entry to be updated, and the ARP Sender Hardware
Address is set to the link-layer address to which this cache
entry should be updated. When using an ARP Reply packet, the
Target Hardware Address is also set to the link-layer address to
which this cache entry should be updated (this field is not used
in an ARP Request packet).
In either case, for a gratuitous ARP, the ARP packet MUST be
transmitted as a local broadcast packet on the local link. As
specified in , any node receiving any ARP packet (Request or
Reply) MUST update its local ARP cache with the Sender Protocol
and Hardware Addresses in the ARP packet, if the receiving node
has an entry for that IP address already in its ARP cache. This
requirement in the ARP protocol applies even for ARP Request
packets, and for ARP Reply packets that do not match any ARP
Request transmitted by the receiving node .
The packets that send_arp (tool called by pulse) is sending to gratuitously arp
the network on service startup or failover does not comply with the RFC
requirements indicated above.
As a result, the upstream switches and routers on the network are not updating
their arp caches, leading to an interruption of services on the virtual IP('s).
Patch which I believe corrects problem is attached, however the original
customer reporting the problem never bothered to test the test packages, and
their IT closed. tcpdumps taken with patch applied *appear* to indicate
correctly-formed gratuitous ARP's (although wireshark only seems to label
arp-request-style gratuitous arp packets as gratuitous arp's). Someone may wish
to double-check to make sure I have the right idea.
Version-Release number of selected component (if applicable):
Piranha (all versions 0.7.x - 0.8.4)
Steps to Reproduce:
1. start tcpdump
2. start pulse service (or fail from primary to backup, vice-versa)
3. inspect gratuitous arps sent by send_arp util called by pulse
Incorrectly-formed gratuitous arp packets that don't convince strictly
RFC-compliant routers and hosts to update their ARP cache
piranha should be able to correctly update ARP cache on other hosts
The attached .pcap file is a packet capture I generated from a test standalone
Piranha system. The odd-numbered frames between frame 13 and 21 are the 5
"gratuitous arps" pulse/send_arp generated as the pulse daemon started and
attempted to gratuitously arp the network.
Based on what RFC2002 says (cited above), the following are problems with the
arp packets being sent:
1. Both the ARP sender and ARP target protocol addresses are to be set to the IP
address of the ARP cache entry to be updated. The packets sent have the correct
sender IP address (192.168.1.250), but the target IP is incorrectly set to the
broadcast address of the local IP subnet. (192.168.1.255)
2. When an ARP Reply type packet is used, the sender AND target hardware
addresses MUST be set to the MAC address of the ARP cache entry to be updated.
The packets sent by pulse/send_arp contain the correct ARP sender MAC address,
but set the ARP target MAC address to the broadcast address (FF:FF:FF:FF:FF:FF).
Therefore any hosts or layer 3+ switches on the network which are "picky" about
RFC compliance could not be expected to update their local ARP cache tables from
these gratuitous arps.
The arping utility from iputils package sends correctly-formed gratuitous arps,
although they are arp-request-style packets. (RFC2002 defines both arp request
and arp response style gratuitous arp packets).
Created attachment 157925 [details]
Patch to make gratuitous ARP's generated by send_arp RFC2002-compliant
Created attachment 157928 [details]
small tcpdump from pulse service startup containing example bad gratuitous arp pkts
Patch is in CVS
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.