Description of problem: While deploying ovn-kubernetes without configuring the normal allow-related ACL for mgmt traffic, sometimes deployments will fail because coredns pods cannot become ready. The reason they cannot become ready is that they are unable to contact the K8S API server (north/south traffic). From tcpdump it can be seen that the packet does make from the pod -> api server and is SNAT and DNAT'ed accordingly. However, return traffic is arriving back to the pod with a SYN ACK, but not unDNAT'ed. This causes the pod to send a TCP RST to the unknown endpoint IP: [root@pod1 /]# tcpdump -i any -nn -vv tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 15:57:38.161176 IP (tos 0x0, ttl 64, id 19562, offset 0, flags [DF], proto TCP (6), length 60) 10.244.0.4.45584 > 10.96.0.1.443: Flags [S], cksum 0x1587 (incorrect -> 0x0d60), seq 1779773811, win 65280, options [mss 1360,sackOK,TS val 2393262256 ecr 0,nop,wscale 7], length 0 15:57:38.163072 IP (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto TCP (6), length 60) 172.17.0.2.6443 > 10.244.0.4.45584: Flags [S.], cksum 0xc2ed (correct), seq 550974793, ack 1779773812, win 65160, options [mss 1460,sackOK,TS val 853084249 ecr 2393262256,nop,wscale 7], length 0 15:57:38.163102 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40) 10.244.0.4.45584 > 172.17.0.2.6443: Flags [R], cksum 0x9210 (correct), seq 1779773812, win 0, length 0 by simply adding an allow-related ACL that matches nothing, the problem is fixed (this is because it forces all traffic on the switch into conntrack): ovn-nbctl acl-add ovn-control-plane from-lport 1001 ip4.dst=1.1.1.1 allow-related 17:57:24.431979 IP 10.244.0.5.59458 > 10.96.0.1.443: Flags [S], seq 3193775616, win 65280, options [mss 1360,sackOK,TS val 3432435366 ecr 0,nop,wscale 7], length 0 17:57:24.434159 IP 10.96.0.1.443 > 10.244.0.5.59458: Flags [S.], seq 1148510621, ack 3193775617, win 64704, options [mss 1360,sackOK,TS val 3640936395 ecr 3432435366,nop,wscale 7], length 0 17:57:24.434193 IP 10.244.0.5.59458 > 10.96.0.1.443: Flags [.], ack 1, win 510, options [nop,nop,TS val 3432435368 ecr 3640936395], length 0 More info can be found here: https://gist.github.com/trozet/d6e42b71f5d8cc3e04dc49a5111f789c
Created attachment 1698139 [details] logs, dbs
For some reason this problem does not happen every deployment, I would say it happens around 50% of the time. I'll attach all the logs,dbs from a working setup as well so they can be compared.
Created attachment 1698172 [details] logs and dbs for when things work
This problem can be hard to address without using conntrack. I'm working on an approach to send the traffic to conntrack only if necessary as opposed to sending all the traffic to conntrack even if there is one ACL with allow-related action. I'm still not sure if that approach would work out fine. But I'm giving a try and working on a POC. I'll keep updating the status here. There is another BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1836804 related to this. So if this approach works, then ovn-k8s can continue to use allow-related (or a new type - allow-reply) ACLs. Thanks Numan
Found the issue. I've submitted the patch to fix it - https://patchwork.ozlabs.org/project/openvswitch/patch/20200707131622.581859-1-numans@ovn.org/
Steps to reproduce the issue -------- ovn-nbctl ls-add ls1 ovn-nbctl lsp-add ls1 ls1p1 -- lsp-set-addresses ls1p1 "10:14:00:00:00:04 10.0.0.4" ovn-nbctl lsp-add ls1 ls1p2 -- lsp-set-addresses ls1p2 "10:14:00:00:00:05 10.0.0.5" ovn-nbctl lr-add lr1 ovn-nbctl lrp-add lr1 lr1-ls1 00:00:00:00:ff:01 10.0.0.1/24 ovn-nbctl lsp-add ls1 ls1-lr1 ovn-nbctl lsp-set-type ls1-lr1 router ovn-nbctl lsp-set-addresses ls1-lr1 router ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1 ovn-nbctl lb-add lb1 "10.0.0.10" "10.0.0.5" ovn-nbctl ls-lb-add ls1 lb1 ovn-nbctl lr-lb-add lr1 lb1 ovn-nbctl lb-add lb2 "10.0.0.20" "10.0.0.5" ovn-nbctl ls-lb-add ls1 lb2 ovn-nbctl lr-lb-add lr1 lb2 # On any node where ovn-controller is running ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal ip netns add ls1p1 ip link set ls1p1 netns ls1p1 ip netns exec ls1p1 ip link set lo up ip netns exec ls1p1 ip link set ls1p1 up ip netns exec ls1p1 ip link set ls1p1 address 10:14:00:00:00:04 ip netns exec ls1p1 ip addr add 10.0.0.4/24 dev ls1p1 ip netns exec ls1p1 ip route add default via 10.0.0.1 dev ls1p1 ovs-vsctl set Interface ls1p1 external_ids:iface-id=ls1p1 ovs-vsctl add-port br-int ls1p2 -- set interface ls1p2 type=internal ip netns add ls1p2 ip link set ls1p2 netns ls1p2 ip netns exec ls1p2 ip link set lo up ip netns exec ls1p2 ip link set ls1p2 up ip netns exec ls1p2 ip link set ls1p2 address 10:14:00:00:00:05 ip netns exec ls1p2 ip addr add 10.0.0.5/24 dev ls1p2 ip netns exec ls1p2 ip route add default via 10.0.0.1 dev ls1p2 ovs-vsctl set Interface ls1p2 external_ids:iface-id=ls1p2 # ping to vips. Should work fine ip netns exec ls1p1 ping 10.0.0.10 -c3 ip netns exec ls1p1 ping 10.0.0.20 -c3 lb=$(ovn-nbctl --bare --columns load_balancer list logical_switch ls1 | cut -d ' ' -f2) ovn-nbctl clear load_balancer $lb vips # Now ping from ls1p1 to the load balancer vip which is still set lb1=$(ovn-nbctl --bare --columns load_balancer list logical_switch ls1 | cut -d ' ' -f1) ovn-nbctl get load_balancer $lb1 vips If vip set on $lb1 is 10.0.0.20 then Actual [root@ovn-chassis-1 ~]# ip netns exec ls1p1 ping 10.0.0.20 PING 10.0.0.20 (10.0.0.20) 56(84) bytes of data. 64 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=1.13 ms 64 bytes from 10.0.0.5: icmp_seq=2 ttl=64 time=0.126 ms This is wrong. The reply should be from the VIP - 10.0.0.20 Expected [root@ovn-chassis-1 ~]# ip netns exec ls1p1 ping 10.0.0.20 PING 10.0.0.20 (10.0.0.20) 56(84) bytes of data. 64 bytes from 10.0.0.20: icmp_seq=1 ttl=64 time=2.19 ms 64 bytes from 10.0.0.20: icmp_seq=2 ttl=64 time=1.30 ms 64 bytes from 10.0.0.20: icmp_seq=3 ttl=64 time=0.165 ms
I tagged the build into OCP 4.6 since we're still under development branch rules there.
Use the reproducer in comment6,I can reproduce the issue on version: # rpm -qa|grep ovn ovn2.13-host-2.13.0-37.el8fdp.x86_64 ovn2.13-2.13.0-37.el8fdp.x86_64 ovn2.13-central-2.13.0-37.el8fdp.x86_64 about half of times,the ping will get the wrong reply ip. # ip netns exec ls1p1 ping 10.0.0.20 PING 10.0.0.20 (10.0.0.20) 56(84) bytes of data. 64 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=1.13 ms 64 bytes from 10.0.0.5: icmp_seq=2 ttl=64 time=0.126 ms on the latest version: # rpm -qa|grep ovn ovn2.13-host-2.13.0-39.el8fdp.x86_64 ovn2.13-2.13.0-39.el8fdp.x86_64 ovn2.13-central-2.13.0-39.el8fdp.x86_64 I ran many times,and the ping could get the right reply ip every time. # ip netns exec ls1p1 ping 10.0.0.20 PING 10.0.0.20 (10.0.0.20) 56(84) bytes of data. 64 bytes from 10.0.0.20: icmp_seq=1 ttl=64 time=2.19 ms 64 bytes from 10.0.0.20: icmp_seq=2 ttl=64 time=1.30 ms
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3150