Red Hat Bugzilla – Bug 1380405
send_arp usage() needs update to reflect send_arp.libnet compatibility options
Last modified: 2017-08-01 10:55:11 EDT
Description of problem: IPaddr2 resource agent is using wrong send_arp arguments and this could be the reason why in some environments network switches are not having their arp tables updated. Looking at /usr/lib/ocf/resource.d/heartbeat/IPaddr2, I could see that: a) SENDARP=$HA_BIN/send_arp b) HA_BIN=/usr/libexec/heartbeat c) arguments to SENDARP: . ARGS="-i $OCF_RESKEY_arp_interval -r $OCF_RESKEY_arp_count -p $SENDARPPIDFILE $NIC $OCF_RESKEY_ip auto not_used not_used" or . ARGS="-i $OCF_RESKEY_arp_interval -r $OCF_RESKEY_arp_count -p $SENDARPPIDFILE $NIC $OCF_RESKEY_ip $MY_MAC not_used not_used" d) arguments accepted by /usr/libexec/heartbeat/send_arp Usage: arping [-fqbDUAV] [-c count] [-w timeout] [-I device] [-s source] destination -f : quit on first reply -q : be quiet -b : keep broadcasting, don't go unicast -D : duplicate address detection mode -U : Unsolicited ARP mode, update your neighbours -A : ARP answer mode, update your neighbours -V : print version and exit -c count : how many packets to send -w timeout : how long to wait for a reply -I device : which ethernet device to use (eth0) -s source : source ip address destination : ask for what ip address e) No other send_arp file in the system [root@dc01-controller-0 ~]# find / -iname send_arp /usr/libexec/heartbeat/send_arp Running send_arp manually, using the arguments above: [root@dc01-controller-0 heartbeat]# ./send_arp -i 200 -r 5 -p /tmp/arp vlan20 10.3.28.21 auto not_used not_used ARPING 10.3.28.21 from 10.3.28.21 vlan20 Sent 5 probes (5 broadcast(s)) Received 0 response(s) [root@dc01-controller-0 heartbeat]# echo $? 0 It's not failing, but it's probably not doing the right thing, cause it's sending ARP requests to itself (10.3.28.21 is local interface address). Now, using the expected syntax: [root@dc01-controller-0 heartbeat]# ./send_arp -c 5 -U -I vlan20 -s 10.3.28.21 10.3.28.1 ARPING 10.3.28.1 from 10.3.28.21 vlan20 Sent 5 probes (5 broadcast(s)) Received 0 response(s) [root@dc01-controller-0 heartbeat]# echo $? 0 Note that it's sending ARP requests to 10.3.28.1 (which could be switch IP), from the local interface address (10.3.28.21). I'm not getting any replies cause I don't have a switch at the target address. By the way, IPaddr2 resource agent is using the arguments expected by another tool, which is also part of CluserLabs resource agents, but it seems that we don't ship[1]. 1: https://github.com/ClusterLabs/resource-agents/blob/master/tools/send_arp.libnet.c Version-Release number of selected component (if applicable): resource-agents-3.9.5-54.el7_2.16.x86_64 How reproducible: At least in one environment with Cisco Nexus 9000 switches VIP failover failed every time. It's worth noticing that in this same environment, running the following command would force the switch to update it's arp table: # arping -U -c 1 -I ${IFACE} -s ${IFACE_IP} ${SWITCH_IP} Steps to Reproduce: 1. Force VIP failover 2. Check if gratuitous arps are being sent the correct way (tcpdump?) Actual results: send_arp is called with wrong arguments and this probably causes wrong gratuitous arp requests. Expected results: send_arp should be called with correct arguments.
Hi, ============= [root@dc01-controller-0 heartbeat]# ./send_arp -i 200 -r 5 -p /tmp/arp vlan20 10.3.28.21 auto not_used not_used ARPING 10.3.28.21 from 10.3.28.21 vlan20 Sent 5 probes (5 broadcast(s)) Received 0 response(s) [root@dc01-controller-0 heartbeat]# echo $? 0 It's not failing, but it's probably not doing the right thing, cause it's sending ARP requests to itself (10.3.28.21 is local interface address). ============== It's doing exactly what it should do, it's sending a properly crafted gratuitous ARP packet https://wiki.wireshark.org/Gratuitous_ARP ~~~ A gratuitous ARP request is an AddressResolutionProtocol request packet where the source and destination IP are both set to the IP of the machine issuing the packet and the destination MAC is the broadcast address ff:ff:ff:ff:ff:ff. ~~~ As well as in RFC 5944, the official definition https://tools.ietf.org/html/rfc5944#page-74 ~~~ A Gratuitous ARP [45] is an ARP packet sent by a node in order to spontaneously cause other nodes to update an entry in their ARP cache. A gratuitous ARP MAY use either an ARP Request or an ARP Reply packet. In either case, the ARP Sender Protocol Address and ARP Target Protocol Address are both set to the IP address of the cache entry to be updated, and the ARP Sender Hardware Address is set to the link-layer address to which this cache entry should be updated. When using an ARP Reply packet, the Target Hardware Address is also set to the link-layer address to which this cache entry should be updated (this field is not used in an ARP Request packet). ~~~ ~~~ [root@overcloud-controller-0 ~]# yum install wireshark -y [root@overcloud-controller-0 ~]# tshark -ivlan905 "arp and ether host f2:67:52:70:09:d1" -O arp & /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /tmp/arp vlan905 10.3.28.21 auto not_used not_used [1] 31475 ARPING 10.3.28.21 from 10.3.28.21 vlan905 Running as user "root" and group "root". This could be dangerous. Capturing on 'vlan905' Frame 1: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0 Ethernet II, Src: f2:67:52:70:09:d1 (f2:67:52:70:09:d1), Dst: Broadcast (ff:ff:ff:ff:ff:ff) Address Resolution Protocol (request/gratuitous ARP) Hardware type: Ethernet (1) Protocol type: IP (0x0800) Hardware size: 6 Protocol size: 4 Opcode: request (1) [Is gratuitous: True] Sender MAC address: f2:67:52:70:09:d1 (f2:67:52:70:09:d1) Sender IP address: 10.3.28.21 (10.3.28.21) Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff) Target IP address: 10.3.28.21 (10.3.28.21) ~~~ The reason this is not working is likely that the Cisco device either never receives a broadcast (because it gets blocked somewhere, e.g. disabled ARP flooding, or because the device does not correctly react to the ARP request). In case of Cisco ACI, we/Cisco should perhaps tell customers to enable ARP flooding? The reason why your approach is working is that you do not send a gratuitous ARP packet, but a normal ARP request with all 1s in the target MAC address, asking the device at 10.3.28.1 to return its MAC address. ~~~ [root@overcloud-controller-0 ~]# tshark -ivlan905 "arp and ether host f2:67:52:70:09:d1" -O arp & /usr/libexec/heartbeat/send_arp -c 5 -U -I vlan905 -s 10.3.28.21 10.3.28.1 [1] 2329 ARPING 10.3.28.1 from 10.3.28.21 vlan905 Running as user "root" and group "root". This could be dangerous. Capturing on 'vlan905' Frame 1: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0 Ethernet II, Src: f2:67:52:70:09:d1 (f2:67:52:70:09:d1), Dst: Broadcast (ff:ff:ff:ff:ff:ff) Address Resolution Protocol (request) Hardware type: Ethernet (1) Protocol type: IP (0x0800) Hardware size: 6 Protocol size: 4 Opcode: request (1) Sender MAC address: f2:67:52:70:09:d1 (f2:67:52:70:09:d1) Sender IP address: 10.3.28.21 (10.3.28.21) Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff) Target IP address: 10.3.28.1 (10.3.28.1) ~~~ Compare this to a "normal" ARP request ~~~ [root@overcloud-controller-0 ~]# tshark -ivlan905 "arp and ether host f2:67:52:70:09:d1" -O arp & ping 10.0.0.100 [1] 8462 PING 10.0.0.100 (10.0.0.100) 56(84) bytes of data. Running as user "root" and group "root". This could be dangerous. Capturing on 'vlan905' Frame 1: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0 Ethernet II, Src: f2:67:52:70:09:d1 (f2:67:52:70:09:d1), Dst: Broadcast (ff:ff:ff:ff:ff:ff) Address Resolution Protocol (request) Hardware type: Ethernet (1) Protocol type: IP (0x0800) Hardware size: 6 Protocol size: 4 Opcode: request (1) Sender MAC address: f2:67:52:70:09:d1 (f2:67:52:70:09:d1) Sender IP address: 10.0.0.5 (10.0.0.5) Target MAC address: 00:00:00_00:00:00 (00:00:00:00:00:00) Target IP address: 10.0.0.100 (10.0.0.100) ~~~ According to RFC 826, it doesn't matter if it's all 1s or 0s https://tools.ietf.org/html/rfc826 ~~~ ares_hrd$Ethernet, ar$pro to the protocol type that is being resolved, ar$hln to 6 (the number of bytes in a 48.bit Ethernet address), ar$pln to the length of an address in that protocol, ar$op to ares_op$REQUEST, ar$sha with the 48.bit ethernet address of itself, ar$spa with the protocol address of itself, and ar$tpa with the protocol address of the machine that is trying to be accessed. It does not set ar$tha to anything in particular, because it is this value that it is trying to determine. It could set ar$tha to the broadcast address for the hardware (all ones in the case of the 10Mbit Ethernet) if that makes it convenient for some aspect of the implementation. ~~~ Long story short, the rsource_agent's behavior looks o.k. to me, something else in the network is misbehaving if this is not working.
I won't argue if the resource agent is sending a correct GARP or not, as I trust your research. I also know that when ACI is used and ARP flooding is enabled it works, although I'm unable to say if this is good practice. My main concern is this may be working by accident. send_arp is being called with wrong arguments. It expects: Usage: arping [-fqbDUAV] [-c count] [-w timeout] [-I device] [-s source] destination -f : quit on first reply -q : be quiet -b : keep broadcasting, don't go unicast -D : duplicate address detection mode -U : Unsolicited ARP mode, update your neighbours -A : ARP answer mode, update your neighbours -V : print version and exit -c count : how many packets to send -w timeout : how long to wait for a reply -I device : which ethernet device to use (eth0) -s source : source ip address destination : ask for what ip address We're sending: -i 200 -r 5 -p /tmp/arp vlan905 10.3.28.21 auto not_used not_used No match between expected and used options.
Ah, ok, got it. I think it's just a copy paste of the help file of arping ... https://github.com/ClusterLabs/resource-agents/blob/ca1e614c6cf9f85fb7341a6086b003735589a3a6/tools/send_arp.linux.c ~~~ void usage(void) { fprintf(stderr, "Usage: arping [-fqbDUAV] [-c count] [-w timeout] [-I device] [-s source] destination\n" " -f : quit on first reply\n" " -q : be quiet\n" " -b : keep broadcasting, don't go unicast\n" " -D : duplicate address detection mode\n" " -U : Unsolicited ARP mode, update your neighbours\n" " -A : ARP answer mode, update your neighbours\n" " -V : print version and exit\n" " -c count : how many packets to send\n" " -w timeout : how long to wait for a reply\n" " -I device : which ethernet device to use" #ifdef DEFAULT_DEVICE_STR " (" DEFAULT_DEVICE_STR ")" #endif "\n" " -s source : source ip address\n" " destination : ask for what ip address\n" ); exit(2); } ~~~ The actual options are here ~~~ while ((ch = getopt(argc, argv, "h?bfDUAqc:w:s:I:Vr:i:p:")) != EOF) { switch(ch) { case 'b': broadcast_only=1; break; case 'D': dad++; quit_on_reply=1; break; case 'U': unsolicited++; break; case 'A': advert++; unsolicited++; break; case 'q': quiet++; break; case 'r': /* send_arp.libnet compatibility option */ hb_mode = 1; /* fall-through */ case 'c': count = atoi(optarg); break; case 'w': timeout = atoi(optarg); break; case 'I': device.name = optarg; break; case 'f': quit_on_reply=1; break; case 's': source = optarg; break; case 'V': printf("send_arp utility, based on arping from iputils-%s\n", SNAPSHOT); exit(0); case 'p': case 'i': hb_mode = 1; /* send_arp.libnet compatibility options, ignore */ break; case 'h': case '?': default: usage(); } } ~~~ Note how '-p' does nothing, and -r does this: ~~~ case 'r': /* send_arp.libnet compatibility option */ hb_mode = 1; /* fall-through */ ~~~ Check then how hb_mode changes the interpretation of arguments ~~~ if(hb_mode) { /* send_arp.libnet compatibility mode */ if (argc - optind != 5) { usage(); return 1; } /* * argv[optind+1] DEVICE dc0,eth0:0,hme0:0, * argv[optind+2] IP 192.168.195.186 * argv[optind+3] MAC ADDR 00a0cc34a878 * argv[optind+4] BROADCAST 192.168.195.186 * argv[optind+5] NETMASK ffffffffffff */ unsolicited = 1; device.name = argv[optind]; target = argv[optind+1]; } ~~~ Note also that although not required, optnd+3 +4 and +5 are not used in the code. So, this bug should be for the "void usage" method, which needs an update. Too lazy to check the rest of the code, because this here proves that it nevertheless does what it needs to do ~~~ [root@overcloud-controller-0 ~]# yum install wireshark -y [root@overcloud-controller-0 ~]# tshark -ivlan905 "arp and ether host f2:67:52:70:09:d1" -O arp & /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /tmp/arp vlan905 10.3.28.21 auto not_used not_used [1] 31475 ARPING 10.3.28.21 from 10.3.28.21 vlan905 Running as user "root" and group "root". This could be dangerous. Capturing on 'vlan905' Frame 1: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0 Ethernet II, Src: f2:67:52:70:09:d1 (f2:67:52:70:09:d1), Dst: Broadcast (ff:ff:ff:ff:ff:ff) Address Resolution Protocol (request/gratuitous ARP) Hardware type: Ethernet (1) Protocol type: IP (0x0800) Hardware size: 6 Protocol size: 4 Opcode: request (1) [Is gratuitous: True] Sender MAC address: f2:67:52:70:09:d1 (f2:67:52:70:09:d1) Sender IP address: 10.3.28.21 (10.3.28.21) Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff) Target IP address: 10.3.28.21 (10.3.28.21) ~~~ But I agree with you that the `-h` option and `void usage` method need a fix.
Thanks for the comprehensive analysis, Andreas. I'll admit I opened the source code and stopped to read it at the "usage" method. Never thought one would "forget" to update it. I'll update the summary to reflect the real issue.
There seems to be other issues with it as well. "The send_arp utility for linux ignores the src_hw_addr, broadcast_ip_addr, and netmask arguments. This results in the utility sending out the wrong mac address when called by IProute2 in clone (clusterip) mode." https://github.com/ClusterLabs/resource-agents/issues/860
Tested and working patch: https://github.com/ClusterLabs/resource-agents/pull/961
Verified based on comment 13. (resource-agents-3.9.5-105)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1844