Description of problem: After investigating the shared gateway issues of "transport is closing" (https://bugzilla.redhat.com/show_bug.cgi?id=1872470) between kube-apiserver and etcd, I found that the issue is with source port collision between the host and pod networked pods. Consider the following shared gateway mode topology: 172.20.0.4:4444 172.20.0.3 (snat) 10.244.1.5 etcd server (node 2)------------(node1)----br-ex----OVN GR----<OVN network>---pod 1 | | |----- host stack (kube-apiserver) In this case both the host interface and OVN GR share the same IP 172.20.0.3. Kubeapi-server will make many connections a second to etcd server, choosing a random source port. At the same time pod1 may be openshift-apiserver, who also makes many connections to etcd server and chooses a random source port. The openshift-apiserver source IP will get SNAT'ed by the GR to the host IP at 172.20.0.3 At some point both processes will try to connect using the same source port (let's use 8000 for this example). Assume there is already a connection from kube-apiserver to etcd. Normally in Linux, it will choose a different source port during the SNAT because it will notice there is a conflict due to a port already in use: (local gateway mode output): root@ovn-worker:/# conntrack -L | grep 4444 tcp 6 86379 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=172.20.0.2 sport=4444 dport=53039 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 86379 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=10.244.1.5 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=2 use=1 tcp 6 86379 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=10.244.1.5 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=25 use=1 tcp 6 86293 ESTABLISHED src=172.20.0.2 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=172.20.0.2 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1 We can see in the above, the pod SNAT chose 53039 as the new source port, instead of using 8000. However in shared gateway mode: root@ovn-worker:/# conntrack -L | grep 4444 tcp 6 86394 ESTABLISHED src=172.20.0.3 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=172.20.0.3 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64000 use=1 tcp 6 86322 ESTABLISHED src=172.20.0.3 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=172.20.0.3 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 86394 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=172.20.0.3 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=15 use=1 tcp 6 86394 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=10.244.1.5 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=25 use=1 conntrack v1.4.6 (conntrack-tools): 219 flow entries have been shown. We can see that the source port remained 8000 during the SNAT, because OVN is using a different CT zone. This ends up causing etcd to get packets from different connections in the node1 that look like they are part of the same TCP connection in etcd. There are 2 possible solutions here: 1. Split the local_port_range in the kernel to 2 segments. Give one slice to OVN for doing SNAT on the gateway into a unique port range. 2. Make OVN configurable to be able to use default zone on GR so that the SNAT will choose another port. I think option 2 is better.
Hi Tim, I have a patch ready that does option 2. I'm currently working on making a scratch RPM so you can test it out. I'll give further instructions once I have the RPMs created.
Thanks Mark. Locally testing your patch with KIND things are working correctly: conntrack v1.4.6 (conntrack-tools): 211 flow entries have been shown. tcp 6 86364 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8888 dport=4444 src=172.20.0.4 dst=172.20.0.2 sport=4444 dport=47078 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=2 tcp 6 86364 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8888 dport=4444 src=172.20.0.4 dst=10.244.1.5 sport=4444 dport=8888 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=27 use=1 tcp 6 86315 ESTABLISHED src=172.20.0.2 dst=172.20.0.4 sport=8888 dport=4444 src=172.20.0.4 dst=172.20.0.2 sport=4444 dport=8888 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 86364 ESTABLISHED src=172.20.0.2 dst=172.20.0.4 sport=47078 dport=4444 src=172.20.0.4 dst=172.20.0.2 sport=4444 dport=47078 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64000 use=1 The pod attempting to also use 8888 was changed to use 47078.
I've sent an upstream version of the patch here: https://patchwork.ozlabs.org/project/ovn/patch/20201112145621.155336-1-mmichels@redhat.com/ The main difference between it and the version I initially shared with you is that in this one, you set options:snat-ct-zone=<integer> in the logical router. If you want to use the default zone, use "0".
tested with following script: systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.161.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.161.25 systemctl restart ovn-controller ovn-nbctl lr-add R1 ovn-nbctl lr-add R2 ovn-nbctl lr-add R3 ovn-nbctl set logical_router R1 options:chassis=hv1 ovn-nbctl set logical_router R2 options:chassis=hv1 ovn-nbctl set logical_router R3 options:chassis=hv1 ovn-nbctl ls-add foo ovn-nbctl ls-add bar ovn-nbctl ls-add alice ovn-nbctl ls-add bob ovn-nbctl ls-add join ovn-nbctl lrp-add R1 foo 00:00:01:01:02:03 192.168.1.1/24 2001::1/64 ovn-nbctl lsp-add foo rp-foo -- set logical_switch_port rp-foo \ type=router options:router-port=foo addresses=\"00:00:01:01:02:03\" ovn-nbctl lrp-add R1 bar 00:00:01:01:02:04 192.168.2.1/24 2002::1/64 ovn-nbctl lsp-add bar rp-bar -- set Logical_Switch_Port rp-bar \ type=router options:router-port=bar addresses=\"00:00:01:01:02:04\" ovn-nbctl lrp-add R2 alice 00:00:02:01:02:03 172.16.1.1/24 3001::1/64 ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port rp-alice \ type=router options:router-port=alice addresses=\"00:00:02:01:02:03\" ovn-nbctl lrp-add R3 bob 00:00:03:01:02:03 172.17.1.1/24 3002::1/64 ovn-nbctl lsp-add bob rp-bob -- set Logical_Switch_Port rp-bob \ type=router options:router-port=bob addresses=\"00:00:03:01:02:03\" ovn-nbctl lrp-add R1 R1_join 00:00:04:01:02:03 20.0.0.1/24 4000::1/64 ovn-nbctl lsp-add join r1-join -- set Logical_Switch_Port r1-join \ type=router options:router-port=R1_join addresses='"00:00:04:01:02:03"' ovn-nbctl lrp-add R2 R2_join 00:00:04:01:02:04 20.0.0.2/24 4000::2/64 ovn-nbctl lsp-add join r2-join -- set Logical_Switch_Port r2-join \ type=router options:router-port=R2_join addresses='"00:00:04:01:02:04"' ovn-nbctl lrp-add R3 R3_join 00:00:04:01:02:05 20.0.0.3/24 4000::3/64 ovn-nbctl lsp-add join r3-join -- set Logical_Switch_Port r3-join \ type=router options:router-port=R3_join addresses='"00:00:04:01:02:05"' ovn-nbctl lr-route-add R2 192.168.0.0/16 20.0.0.1 ovn-nbctl lr-route-add R3 192.168.0.0/16 20.0.0.1 ovn-nbctl lr-route-add R2 2001::/64 4000::1 ovn-nbctl lr-route-add R2 2002::/64 4000::1 ovn-nbctl lr-route-add R3 2001::/64 4000::1 ovn-nbctl lr-route-add R3 2002::/64 4000::1 ovn-nbctl lr-route-add R2 1.1.1.0/24 172.16.1.3 ovn-nbctl lr-route-add R3 1.1.1.0/24 172.17.1.4 ovn-nbctl lr-route-add R2 1111::/64 3001::3 ovn-nbctl lr-route-add R3 1111::/64 3002::4 ovn-nbctl --wait=hv sync R1_nb_uuid=$(ovn-nbctl get Logical_Router R1 _uuid) R1_sb_uuid=$(ovn-sbctl --bare --columns=_uuid find Datapath_Binding external-ids:logical-router=${R1_nb_uuid}) R1_snat_zone=$(ovs-vsctl get bridge br-int external-ids:ct-zone-${R1_sb_uuid}_snat | tr -d \") echo "R1:$R1_snat_zone" R2_nb_uuid=$(ovn-nbctl get Logical_Router R2 _uuid) R2_sb_uuid=$(ovn-sbctl --bare --columns=_uuid find Datapath_Binding external-ids:logical-router=${R2_nb_uuid}) R2_snat_zone=$(ovs-vsctl get bridge br-int external-ids:ct-zone-${R2_sb_uuid}_snat | tr -d \") echo "R2:$R2_snat_zone" ovn-nbctl --wait=hv set Logical_Router R1 options:snat-ct-zone=111 ovs-vsctl get bridge br-int external-ids:ct-zone-${R1_sb_uuid}_snat ovn-nbctl --wait=hv set Logical_Router R1 options:snat-ct-zone=$R2_snat_zone ovs-vsctl get bridge br-int external-ids:ct-zone-${R1_sb_uuid}_snat ovs-vsctl get bridge br-int external-ids:ct-zone-${R2_sb_uuid}_snat result on 20.12.0-1: [root@wsfd-advnetlab17 bz1892311]# rpm -qa | grep -E "openvswitch2.13|ovn2.13" ovn2.13-central-20.12.0-1.el8fdp.x86_64 python3-openvswitch2.13-2.13.0-77.el8fdp.x86_64 ovn2.13-host-20.12.0-1.el8fdp.x86_64 ovn2.13-20.12.0-1.el8fdp.x86_64 openvswitch2.13-2.13.0-77.el8fdp.x86_64 ++ ovn-nbctl get Logical_Router R1 _uuid + R1_nb_uuid=d9d5dfab-b76d-4626-8c82-160056f5dc5a ++ ovn-sbctl --bare --columns=_uuid find Datapath_Binding external-ids:logical-router=d9d5dfab-b76d-4626-8c82-160056f5dc5a + R1_sb_uuid=364abef1-03d8-436a-aae7-e3fddfdb0963 ++ tr -d '"' ++ ovs-vsctl get bridge br-int external-ids:ct-zone-364abef1-03d8-436a-aae7-e3fddfdb0963_snat + R1_snat_zone=7 + echo R1:7 R1:7 ++ ovn-nbctl get Logical_Router R2 _uuid + R2_nb_uuid=f2a48d27-c71c-44ed-ab8f-e40c75fff270 ++ ovn-sbctl --bare --columns=_uuid find Datapath_Binding external-ids:logical-router=f2a48d27-c71c-44ed-ab8f-e40c75fff270 + R2_sb_uuid=0cdf4e5a-1fa8-4b92-8894-69ae791d603a ++ ovs-vsctl get bridge br-int external-ids:ct-zone-0cdf4e5a-1fa8-4b92-8894-69ae791d603a_snat ++ tr -d '"' + R2_snat_zone=6 + echo R2:6 R2:6 + ovn-nbctl --wait=hv set Logical_Router R1 options:snat-ct-zone=111 + ovs-vsctl get bridge br-int external-ids:ct-zone-364abef1-03d8-436a-aae7-e3fddfdb0963_snat "111" <=== changed to 111 + ovn-nbctl --wait=hv set Logical_Router R1 options:snat-ct-zone=6 + ovs-vsctl get bridge br-int external-ids:ct-zone-364abef1-03d8-436a-aae7-e3fddfdb0963_snat "6" <=== changed to zone id for R2 + ovs-vsctl get bridge br-int external-ids:ct-zone-0cdf4e5a-1fa8-4b92-8894-69ae791d603a_snat "7" <=== zone id for R2 is changed
Verified on rhel7 version: :: [ 03:26:18 ] :: [ BEGIN ] :: Running 'ovs-vsctl get bridge br-int external-ids:ct-zone-0ab55c34-4b22-4e84-b927-f4aa7b8a7566_snat' "1" :: [ 03:26:18 ] :: [ PASS ] :: Command 'ovs-vsctl get bridge br-int external-ids:ct-zone-0ab55c34-4b22-4e84-b927-f4aa7b8a7566_snat' (Expected 0, got 0) :: [ 03:26:18 ] :: [ BEGIN ] :: Running 'ovs-vsctl get bridge br-int external-ids:ct-zone-50957a04-115d-46b2-8d71-3434ded93ded_snat' "17" :: [ 03:26:18 ] :: [ PASS ] :: Command 'ovs-vsctl get bridge br-int external-ids:ct-zone-50957a04-115d-46b2-8d71-3434ded93ded_snat' (Expected 0, got 0) :: [ 03:26:18 ] :: [ BEGIN ] :: Running 'ovn-nbctl --wait=hv set Logical_Router R2 options:snat-ct-zone=123' :: [ 03:26:18 ] :: [ PASS ] :: Command 'ovn-nbctl --wait=hv set Logical_Router R2 options:snat-ct-zone=123' (Expected 0, got 0) :: [ 03:26:18 ] :: [ BEGIN ] :: Running 'ovs-vsctl get bridge br-int external-ids:ct-zone-0ab55c34-4b22-4e84-b927-f4aa7b8a7566_snat | grep 123' "123" :: [ 03:26:18 ] :: [ PASS ] :: Command 'ovs-vsctl get bridge br-int external-ids:ct-zone-0ab55c34-4b22-4e84-b927-f4aa7b8a7566_snat | grep 123' (Expected 0, got 0) :: [ 03:26:18 ] :: [ BEGIN ] :: Running 'ip netns exec alice1 ping -q 30.0.0.1 -c 1' PING 30.0.0.1 (30.0.0.1) 56(84) bytes of data. --- 30.0.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 2.017/2.017/2.017/0.000 ms :: [ 03:26:18 ] :: [ PASS ] :: Command 'ip netns exec alice1 ping -q 30.0.0.1 -c 1' (Expected 0, got 0) :: [ 03:26:19 ] :: [ BEGIN ] :: Running 'ssh -q wsfd-advnetlab19.anl.lab.eng.bos.redhat.com ip netns exec bob1 ping -q 30.0.0.1 -c 1' PING 30.0.0.1 (30.0.0.1) 56(84) bytes of data. --- 30.0.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.399/0.399/0.399/0.000 ms :: [ 03:26:19 ] :: [ PASS ] :: Command 'ssh -q wsfd-advnetlab19.anl.lab.eng.bos.redhat.com ip netns exec bob1 ping -q 30.0.0.1 -c 1' (Expected 0, got 0) :: [ 03:26:19 ] :: [ BEGIN ] :: Running 'ip netns exec alice1 ping6 -q 6010::1 -c 1' PING 6010::1(6010::1) 56 data bytes --- 6010::1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 2.596/2.596/2.596/0.000 ms [root@wsfd-advnetlab16 nat]# rpm -qa | grep -E "openvswitch2.13|ovn2.13" python3-openvswitch2.13-2.13.0-70.el7fdp.x86_64 openvswitch2.13-2.13.0-70.el7fdp.x86_64 ovn2.13-20.12.0-1.el7fdp.x86_64 ovn2.13-host-20.12.0-1.el7fdp.x86_64 ovn2.13-central-20.12.0-1.el7fdp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0407