Arp is definitely broken: $ grep 10.1.12.236 broken/arp dest/arp good/arp broken/arp:10.1.12.236 dev tun0 lladdr 6e:de:1c:a5:d7:76 REACHABLE dest/arp:10.1.12.236 dev tun0 lladdr 56:f4:5a:9e:48:d7 STALE good/arp:10.1.12.236 dev tun0 lladdr 56:f4:5a:9e:48:d7 REACHABLE Not sure yet what is keeping that entry REACHABLE on the broken node... it should have timed out after a few minutes: $ grep ipv4.neigh.eth0 sysctl net.ipv4.neigh.eth0.anycast_delay = 100 net.ipv4.neigh.eth0.app_solicit = 0 net.ipv4.neigh.eth0.base_reachable_time_ms = 30000 net.ipv4.neigh.eth0.delay_first_probe_time = 5 net.ipv4.neigh.eth0.gc_stale_time = 60 net.ipv4.neigh.eth0.locktime = 100 net.ipv4.neigh.eth0.mcast_solicit = 3 net.ipv4.neigh.eth0.proxy_delay = 80 net.ipv4.neigh.eth0.proxy_qlen = 64 net.ipv4.neigh.eth0.retrans_time_ms = 1000 net.ipv4.neigh.eth0.ucast_solicit = 3 net.ipv4.neigh.eth0.unres_qlen = 31 net.ipv4.neigh.eth0.unres_qlen_bytes = 65536
Perhaps we should send a gratuitous ARP when the pod comes up. However, there's still something fishy here because I don't understand what was keeping the arp cache current on the broken node.
The issue returned for the customer over the weekend.
The issue over the weekend seems to have disappeared when Sten pinged every node from every other node (or else it disappeared shortly before he did that).
That's obviously not a long term solution. Ben, DanW, what do we need/want collected next time this goes nuts? I heard something about a tcpdump from the router pod? What exactly do we want/need?
I think I want a tcpdump from the source node filtered by the target pod IP. So, from the source node: tcpdump -i vxlan_sys_4789 host <your-pod-ip-here> -w /tmp/dump [Please run that for a minute or two] Also, from the source node: ip -s neigh (Do that last one a few times 10 seconds or so apart) I'd also like the target node logs to see how quickly it is cycling the pods on that node.
# ip -s neigh | grep 13\\.184 10.1.13.184 dev tun0 lladdr a2:cb:51:22:a4:f5 ref 1 used 2550/3/2548 probes 1 REACHABLE
If this happens again running: ip neigh flush dev tun0 on the infra nodes should get it back to life. We could consider running this in a loop, every minute or so, to make sure it can never be broken more than 1 minute...
Needless to say, this is just a stop-gap until we work out the real problem and a proper solution.
Did we put the loop into place yet? I think it would be a good idea just to avoid further outage, if we think we know what the issue is and dont need to replicate it again.
Maybe the problem has something to do with this (v2.6 though...): https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327819.html
What loglevel is the cluster running at? If 3 or higher, can I get: journalctl -l -u openshift-node.service | egrep 'Calling network plugin|creating hosts mount' From the router and a node with a broken pod?
They run as loglevel 2
*** Bug 1457399 has been marked as a duplicate of this bug. ***
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/256ddc5164e8ef175b04f12d97dd44b64d5d1771 Change the MAC addresses to be generated based on IP The ARP cache seems to misbehave in some clusters and if an IP address is reused then the MAC address can get out of sync (he root cause has not been identified). When that happens the target pod looks at the MAC address and drops the packets on the floor despite the IP address being correct. This patch changes the code to generate the MAC address based on the IP address so that when an IP is reused, it will have the same MAC address and the packets will flow. This behavior is consistent with what Docker does when allocating MACs and what Kubenet does too. It may change in the future, but this fixes our CNI plugin now. Fixes bug 1451854 (https://bugzilla.redhat.com/show_bug.cgi?id=1451854)
*** Bug 1460328 has been marked as a duplicate of this bug. ***
PR https://github.com/openshift/origin/issues/14685
verified in atomic-openshift-3.6.121-1.git.0.d763f6f.el7.x86_64, the pod MAC address generation has been changed and no other issue found. 3: eth0@if22: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 8951 qdisc noqueue state UP link/ether 0a:58:0a:80:00:09 brd ff:ff:ff:ff:ff:ff inet 10.128.0.9/23 scope global eth0
Edited bug description (with sensitive info removed) ---------------------------------------------------- ping from a node (source) to a pod (dest) on another node failed. ping from a pod on the source node to the dest failed tcpdump showed the echo request getting into the dest pod tcpdump did not show an echo reply at all There was some time between the pings and the collection of the ip neigh info, but ip neigh inside the dest did not show the source node ip address(es) at all. A ping from the dest to the source worked. Now everything seems to be working. tcpdump -n -i vxlan_sys_4789 icmp and host 10.1.12.236 and not host 10.1.0.1 The three commands I'm running (from the failing node): ping 10.1.12.236 nsenter -t 76272 -n ping 10.1.12.236 ping -I 172.31.49.159 10.1.12.236 tcpdump collected on source node: 14:42:47.792901 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 54425, seq 6, length 64 14:42:48.793909 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 54425, seq 7, length 64 14:42:49.795900 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 54425, seq 8, length 64 14:42:50.796885 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 54425, seq 9, length 64 14:42:57.097058 IP 10.1.11.249 > 10.1.12.236: ICMP echo request, id 54665, seq 1, length 64 14:42:57.097629 IP 10.1.12.236 > 10.1.11.249: ICMP echo reply, id 54665, seq 1, length 64 14:42:58.098506 IP 10.1.11.249 > 10.1.12.236: ICMP echo request, id 54665, seq 2, length 64 14:42:58.098861 IP 10.1.12.236 > 10.1.11.249: ICMP echo reply, id 54665, seq 2, length 64 14:51:19.493525 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 2202, seq 1, length 64 14:51:20.493506 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 2202, seq 2, length 64 14:51:21.492936 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 2202, seq 3, length 64 14:51:22.493891 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 2202, seq 4, length 64 tcpdump from the dest node: 14:44:27.521357 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 56760, seq 1, length 64 14:44:28.522166 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 56760, seq 2, length 64 14:44:29.523209 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 56760, seq 3, length 64 14:44:30.524163 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 56760, seq 4, length 64 14:44:34.577879 IP 10.1.11.249 > 10.1.12.236: ICMP echo request, id 56933, seq 1, length 64 14:44:34.578166 IP 10.1.12.236 > 10.1.11.249: ICMP echo reply, id 56933, seq 1, length 64 14:44:35.578124 IP 10.1.11.249 > 10.1.12.236: ICMP echo request, id 56933, seq 2, length 64 14:44:35.578154 IP 10.1.12.236 > 10.1.11.249: ICMP echo reply, id 56933, seq 2, length 64 14:48:24.792790 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 62344, seq 1, length 64 14:48:25.792237 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 62344, seq 2, length 64 14:48:26.793152 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 62344, seq 3, length 64 14:48:27.793552 IP 172.31.49.159 > 10.1.12.236: ICMP echo request, id 62344, seq 4, length 64 tcpdump collected inside the target container net namespace: # nsenter -t 85201 -n tcpdump -n -i any icmp and not host 10.1.0.1 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes 14:54:00.592514 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 5834, seq 1, length 64 14:54:01.592382 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 5834, seq 2, length 64 14:54:02.593354 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 5834, seq 3, length 64 14:54:03.593350 IP 10.1.11.1 > 10.1.12.236: ICMP echo request, id 5834, seq 4, length 64 I don't know who 10.1.0.1 is, but they are pinging this container as well, and it is responding to them. # nsenter -t 85201 -n tcpdump -n -i any icmp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes 14:55:28.097837 IP 10.1.0.1 > 10.1.12.236: ICMP echo request, id 43905, seq 30511, length 64 14:55:28.097868 IP 10.1.12.236 > 10.1.0.1: ICMP echo reply, id 43905, seq 30511, length 64 14:55:29.098831 IP 10.1.0.1 > 10.1.12.236: ICMP echo request, id 43905, seq 30512, length 64 14:55:29.098876 IP 10.1.12.236 > 10.1.0.1: ICMP echo reply, id 43905, seq 30512, length 64 # nsenter -t 85201 -n -- ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 3: eth0@if19835: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP link/ether 56:f4:5a:9e:48:d7 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.1.12.236/24 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::54f4:5aff:fe9e:48d7/64 scope link valid_lft forever preferred_lft forever # nsenter -t 85201 -n -- ip route default via 10.1.12.1 dev eth0 10.1.0.0/16 dev eth0 10.1.12.0/24 dev eth0 proto kernel scope link src 10.1.12.236 # nsenter -t 85201 -n -- iptables-save # Generated by iptables-save v1.4.21 on Wed May 17 14:58:47 2017 *filter :INPUT ACCEPT [119617:7776321] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [91105:11477721] COMMIT # Completed on Wed May 17 14:58:47 2017 # Generated by iptables-save v1.4.21 on Wed May 17 14:58:47 2017 *nat :PREROUTING ACCEPT [28634:1718328] :INPUT ACCEPT [28634:1718328] :OUTPUT ACCEPT [5:346] :POSTROUTING ACCEPT [5:346] COMMIT # Completed on Wed May 17 14:58:47 2017 # nsenter -t 85201 -n -- ip neigh 10.1.12.1 dev eth0 lladdr 32:2d:c2:20:d6:50 STALE 10.1.11.249 dev eth0 lladdr 72:d4:e1:07:93:95 STALE 10.1.0.1 dev eth0 lladdr be:85:76:59:2c:94 REACHABLE 10.1.8.1 dev eth0 lladdr fa:da:95:b0:fe:d9 STALE # nsenter -t 85201 -n -- ping 10.1.11.1 PING 10.1.11.1 (10.1.11.1) 56(84) bytes of data. 64 bytes from 10.1.11.1: icmp_seq=1 ttl=64 time=1.05 ms 64 bytes from 10.1.11.1: icmp_seq=2 ttl=64 time=0.274 ms 64 bytes from 10.1.11.1: icmp_seq=3 ttl=64 time=0.315 ms 64 bytes from 10.1.11.1: icmp_seq=4 ttl=64 time=0.340 ms ^C --- 10.1.11.1 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3000ms rtt min/avg/max/mdev = 0.274/0.496/1.056/0.324 ms After I pinged from the "dest" pod to the source node everything just magically started working and we have no idea how to break it again. Here is the 'dest' pod ip neigh table now that things are working. # nsenter -t 85201 -n -- ip neigh 10.1.12.1 dev eth0 lladdr 32:2d:c2:20:d6:50 STALE 10.1.5.1 dev eth0 lladdr 62:1c:4e:c9:69:41 STALE 10.1.11.249 dev eth0 FAILED 10.1.13.1 dev eth0 lladdr 0a:04:0b:f0:8f:e4 STALE 10.1.6.1 dev eth0 lladdr 4e:ef:70:a7:42:8c STALE 10.1.14.1 dev eth0 lladdr 82:00:6b:93:27:75 STALE 10.1.7.1 dev eth0 lladdr 3e:b5:03:87:3a:18 STALE 10.1.11.1 dev eth0 lladdr 66:26:80:5e:9d:53 REACHABLE 10.1.0.1 dev eth0 lladdr be:85:76:59:2c:94 REACHABLE 10.1.8.1 dev eth0 lladdr fa:da:95:b0:fe:d9 STALE Looking for pods with duplicate IP addresses: # oc get pods --all-namespaces -o wide | awk '{print $7}' | sort | uniq -d | wc -l 0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716
*** Bug 1496647 has been marked as a duplicate of this bug. ***