1451854 – Node lost connectivity to pod on another node (due to invalid ARP cache)

Bug 1451854 - Node lost connectivity to pod on another node (due to invalid ARP cache)

Summary: Node lost connectivity to pod on another node (due to invalid ARP cache)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.4.1
Hardware:	Unspecified
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1457399 1460328 1496647 (view as bug list)
Depends On:
Blocks:	1462952 1462955
TreeView+	depends on / blocked

Reported:	2017-05-17 16:22 UTC by Eric Paris
Modified:	2017-10-31 19:40 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When an IP address was re-used it would be generated with a random MAC address that would be different from the previous one. Consequence: Any node with an ARP cache that still held the old entry for the IP would not be able to communicate with the node. Fix: Generate the MAC address deterministically from the IP address. Result: A re-used IP address will always have the same MAC address, so the ARP cache can not be out of sync. So the traffic will flow.
Clone Of:
Clones:	1462952 1462955 (view as bug list)
Environment:
Last Closed:	2017-08-10 05:24:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Origin (Github)	14685	0	None	None	None	2017-06-15 20:40:28 UTC
Red Hat Product Errata	RHEA-2017:1716	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.6 RPM Release Advisory	2017-08-10 09:02:50 UTC

Comment 1 Ben Bennett 2017-05-17 20:28:03 UTC

Arp is definitely broken:

$ grep 10.1.12.236 broken/arp dest/arp good/arp
broken/arp:10.1.12.236 dev tun0 lladdr 6e:de:1c:a5:d7:76 REACHABLE
dest/arp:10.1.12.236 dev tun0 lladdr 56:f4:5a:9e:48:d7 STALE
good/arp:10.1.12.236 dev tun0 lladdr 56:f4:5a:9e:48:d7 REACHABLE

Not sure yet what is keeping that entry REACHABLE on the broken node... it should have timed out after a few minutes:

 $ grep ipv4.neigh.eth0 sysctl
net.ipv4.neigh.eth0.anycast_delay = 100
net.ipv4.neigh.eth0.app_solicit = 0
net.ipv4.neigh.eth0.base_reachable_time_ms = 30000
net.ipv4.neigh.eth0.delay_first_probe_time = 5
net.ipv4.neigh.eth0.gc_stale_time = 60
net.ipv4.neigh.eth0.locktime = 100
net.ipv4.neigh.eth0.mcast_solicit = 3
net.ipv4.neigh.eth0.proxy_delay = 80
net.ipv4.neigh.eth0.proxy_qlen = 64
net.ipv4.neigh.eth0.retrans_time_ms = 1000
net.ipv4.neigh.eth0.ucast_solicit = 3
net.ipv4.neigh.eth0.unres_qlen = 31
net.ipv4.neigh.eth0.unres_qlen_bytes = 65536

Comment 2 Ben Bennett 2017-05-17 20:31:15 UTC

Perhaps we should send a gratuitous ARP when the pod comes up.

However, there's still something fishy here because I don't understand what was keeping the arp cache current on the broken node.

Comment 3 Steven Walter 2017-05-20 17:19:59 UTC

The issue returned for the customer over the weekend.

Comment 4 Steven Walter 2017-05-22 21:57:24 UTC

The issue over the weekend seems to have disappeared when Sten pinged every node from every other node (or else it disappeared shortly before he did that).

Comment 5 Eric Paris 2017-05-24 12:39:03 UTC

That's obviously not a long term solution. Ben, DanW, what do we need/want collected next time this goes nuts? I heard something about a tcpdump from the router pod? What exactly do we want/need?

Comment 6 Ben Bennett 2017-05-24 13:03:14 UTC

I think I want a tcpdump from the source node filtered by the target pod IP.

So, from the source node:
  tcpdump -i vxlan_sys_4789 host <your-pod-ip-here> -w /tmp/dump
[Please run that for a minute or two]

Also, from the source node:
  ip -s neigh

(Do that last one a few times 10 seconds or so apart)

I'd also like the target node logs to see how quickly it is cycling the pods on that node.

Comment 7 Eric Paris 2017-05-24 18:02:01 UTC

# ip -s neigh | grep 13\\.184
10.1.13.184 dev tun0 lladdr a2:cb:51:22:a4:f5 ref 1 used 2550/3/2548 probes 1 REACHABLE

Comment 8 Eric Paris 2017-05-24 18:16:54 UTC

If this happens again running:

ip neigh flush dev tun0

on the infra nodes should get it back to life. We could consider running this in a loop, every minute or so, to make sure it can never be broken more than 1 minute...

Comment 9 Ben Bennett 2017-05-24 20:32:23 UTC

Needless to say, this is just a stop-gap until we work out the real problem and a proper solution.

Comment 10 Steven Walter 2017-05-25 19:15:11 UTC

Did we put the loop into place yet? I think it would be a good idea just to avoid further outage, if we think we know what the issue is and dont need to replicate it again.

Comment 20 Dan Yocum 2017-05-31 21:17:24 UTC

Maybe the problem has something to do with this (v2.6 though...):

https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327819.html

Comment 28 Ben Bennett 2017-06-07 15:25:22 UTC

What loglevel is the cluster running at?  If 3 or higher, can I get:

  journalctl -l -u openshift-node.service | egrep 'Calling network plugin|creating hosts mount'

From the router and a node with a broken pod?

Comment 29 Eric Paris 2017-06-07 17:11:40 UTC

They run as loglevel 2

Comment 32 Seth Jennings 2017-06-16 03:44:41 UTC

*** Bug 1457399 has been marked as a duplicate of this bug. ***

Comment 33 openshift-github-bot 2017-06-19 23:02:19 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/256ddc5164e8ef175b04f12d97dd44b64d5d1771
Change the MAC addresses to be generated based on IP

The ARP cache seems to misbehave in some clusters and if an IP address
is reused then the MAC address can get out of sync (he root cause has
not been identified).  When that happens the target pod looks at the
MAC address and drops the packets on the floor despite the IP address
being correct.

This patch changes the code to generate the MAC address based on the
IP address so that when an IP is reused, it will have the same MAC
address and the packets will flow.

This behavior is consistent with what Docker does when allocating MACs
and what Kubenet does too.  It may change in the future, but this
fixes our CNI plugin now.

Fixes bug 1451854 (https://bugzilla.redhat.com/show_bug.cgi?id=1451854)

Comment 34 Ben Bennett 2017-06-20 18:47:48 UTC

*** Bug 1460328 has been marked as a duplicate of this bug. ***

Comment 35 Ben Bennett 2017-06-21 15:52:05 UTC

PR https://github.com/openshift/origin/issues/14685

Comment 37 Hongan Li 2017-06-23 08:13:56 UTC

verified in atomic-openshift-3.6.121-1.git.0.d763f6f.el7.x86_64, the pod MAC address generation has been changed and no other issue found.

3: eth0@if22: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 8951 qdisc noqueue state UP 
    link/ether 0a:58:0a:80:00:09 brd ff:ff:ff:ff:ff:ff
    inet 10.128.0.9/23 scope global eth0

Comment 38 Ben Bennett 2017-06-28 17:39:49 UTC

Edited bug description (with sensitive info removed)
----------------------------------------------------

ping from a node (source) to a pod (dest) on another node failed.
ping from a pod on the source node to the dest failed
tcpdump showed the echo request getting into the dest pod
tcpdump did not show an echo reply at all

There was some time between the pings and the collection of the ip neigh info, but ip neigh inside the dest did not show the source node ip address(es) at all.

A ping from the dest to the source worked.
Now everything seems to be working.

tcpdump -n -i vxlan_sys_4789 icmp and host 10.1.12.236 and not host 10.1.0.1
 
The three commands I'm running (from the failing node):
ping 10.1.12.236
nsenter -t 76272 -n ping 10.1.12.236
ping -I 172.31.49.159 10.1.12.236
 
tcpdump collected on source node:
14:42:47.792901 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 54425, seq 6, length 64
14:42:48.793909 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 54425, seq 7, length 64
14:42:49.795900 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 54425, seq 8, length 64
14:42:50.796885 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 54425, seq 9, length 64
14:42:57.097058 IP 10.1.11.249 > 10.1.12.236: ICMP echo request, id 54665, seq 1, length 64
14:42:57.097629 IP 10.1.12.236 > 10.1.11.249: ICMP echo reply, id 54665, seq 1, length 64
14:42:58.098506 IP 10.1.11.249 > 10.1.12.236: ICMP echo request, id 54665, seq 2, length 64
14:42:58.098861 IP 10.1.12.236 > 10.1.11.249: ICMP echo reply, id 54665, seq 2, length 64
14:51:19.493525 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 2202, seq 1, length 64
14:51:20.493506 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 2202, seq 2, length 64
14:51:21.492936 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 2202, seq 3, length 64
14:51:22.493891 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 2202, seq 4, length 64
 
 
tcpdump from the dest node:
14:44:27.521357 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 56760, seq 1, length 64
14:44:28.522166 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 56760, seq 2, length 64
14:44:29.523209 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 56760, seq 3, length 64
14:44:30.524163 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 56760, seq 4, length 64
14:44:34.577879 IP 10.1.11.249 > 10.1.12.236: ICMP echo request, id 56933, seq 1, length 64
14:44:34.578166 IP 10.1.12.236 > 10.1.11.249: ICMP echo reply, id 56933, seq 1, length 64
14:44:35.578124 IP 10.1.11.249 > 10.1.12.236: ICMP echo request, id 56933, seq 2, length 64
14:44:35.578154 IP 10.1.12.236 > 10.1.11.249: ICMP echo reply, id 56933, seq 2, length 64
14:48:24.792790 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 62344, seq 1, length 64
14:48:25.792237 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 62344, seq 2, length 64
14:48:26.793152 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 62344, seq 3, length 64
14:48:27.793552 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 62344, seq 4, length 64
 
 
 
tcpdump collected inside the target container net namespace:
# nsenter -t 85201 -n tcpdump -n -i any icmp and not host 10.1.0.1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
14:54:00.592514 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 5834, seq 1, length 64
14:54:01.592382 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 5834, seq 2, length 64
14:54:02.593354 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 5834, seq 3, length 64
14:54:03.593350 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 5834, seq 4, length 64
 
 
I don't know who 10.1.0.1 is, but they are pinging this container as well, and it is responding to them.
# nsenter -t 85201 -n tcpdump -n -i any icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
14:55:28.097837 IP 10.1.0.1 > 10.1.12.236: ICMP echo request, id 43905, seq 30511, length 64
14:55:28.097868 IP 10.1.12.236 > 10.1.0.1: ICMP echo reply, id 43905, seq 30511, length 64
14:55:29.098831 IP 10.1.0.1 > 10.1.12.236: ICMP echo request, id 43905, seq 30512, length 64
14:55:29.098876 IP 10.1.12.236 > 10.1.0.1: ICMP echo reply, id 43905, seq 30512, length 64
 
# nsenter -t 85201 -n -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
3: eth0@if19835: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP
    link/ether 56:f4:5a:9e:48:d7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.1.12.236/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::54f4:5aff:fe9e:48d7/64 scope link
       valid_lft forever preferred_lft forever
 
 
# nsenter -t 85201 -n -- ip route
default via 10.1.12.1 dev eth0
10.1.0.0/16 dev eth0
10.1.12.0/24 dev eth0  proto kernel  scope link  src 10.1.12.236
 
 
# nsenter -t 85201 -n -- iptables-save
# Generated by iptables-save v1.4.21 on Wed May 17 14:58:47 2017
*filter
:INPUT ACCEPT [119617:7776321]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [91105:11477721]
COMMIT
# Completed on Wed May 17 14:58:47 2017
# Generated by iptables-save v1.4.21 on Wed May 17 14:58:47 2017
*nat
:PREROUTING ACCEPT [28634:1718328]
:INPUT ACCEPT [28634:1718328]
:OUTPUT ACCEPT [5:346]
:POSTROUTING ACCEPT [5:346]
COMMIT
# Completed on Wed May 17 14:58:47 2017
 
 
# nsenter -t 85201 -n -- ip neigh
10.1.12.1 dev eth0 lladdr 32:2d:c2:20:d6:50 STALE
10.1.11.249 dev eth0 lladdr 72:d4:e1:07:93:95 STALE
10.1.0.1 dev eth0 lladdr be:85:76:59:2c:94 REACHABLE
10.1.8.1 dev eth0 lladdr fa:da:95:b0:fe:d9 STALE
 
 
# nsenter -t 85201 -n -- ping 10.1.11.1
PING 10.1.11.1 (10.1.11.1) 56(84) bytes of data.
64 bytes from 10.1.11.1: icmp_seq=1 ttl=64 time=1.05 ms
64 bytes from 10.1.11.1: icmp_seq=2 ttl=64 time=0.274 ms
64 bytes from 10.1.11.1: icmp_seq=3 ttl=64 time=0.315 ms
64 bytes from 10.1.11.1: icmp_seq=4 ttl=64 time=0.340 ms
^C
--- 10.1.11.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3000ms
rtt min/avg/max/mdev = 0.274/0.496/1.056/0.324 ms
 
 
After I pinged from the "dest" pod to the source node everything just magically started working and we have no idea how to break it again.
 
Here is the 'dest' pod ip neigh table now that things are working.
# nsenter -t 85201 -n -- ip neigh
10.1.12.1 dev eth0 lladdr 32:2d:c2:20:d6:50 STALE
10.1.5.1 dev eth0 lladdr 62:1c:4e:c9:69:41 STALE
10.1.11.249 dev eth0  FAILED
10.1.13.1 dev eth0 lladdr 0a:04:0b:f0:8f:e4 STALE
10.1.6.1 dev eth0 lladdr 4e:ef:70:a7:42:8c STALE
10.1.14.1 dev eth0 lladdr 82:00:6b:93:27:75 STALE
10.1.7.1 dev eth0 lladdr 3e:b5:03:87:3a:18 STALE
10.1.11.1 dev eth0 lladdr 66:26:80:5e:9d:53 REACHABLE
10.1.0.1 dev eth0 lladdr be:85:76:59:2c:94 REACHABLE
10.1.8.1 dev eth0 lladdr fa:da:95:b0:fe:d9 STALE
 
Looking for pods with duplicate IP addresses:
# oc get pods --all-namespaces -o wide | awk '{print $7}' | sort | uniq -d | wc -l
0

Comment 40 errata-xmlrpc 2017-08-10 05:24:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Comment 43 Ben Bennett 2017-10-31 19:40:53 UTC

*** Bug 1496647 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.