The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1892311 - Need a way to avoid host and OVN GR source port collision when doing SNAT
Summary: Need a way to avoid host and OVN GR source port collision when doing SNAT
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: OVN
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Mark Michelson
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-28 13:15 UTC by Tim Rozet
Modified: 2021-02-03 21:55 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-03 21:55:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:0407 0 None None None 2021-02-03 21:55:27 UTC

Description Tim Rozet 2020-10-28 13:15:08 UTC
Description of problem:
After investigating the shared gateway issues of "transport is closing" (https://bugzilla.redhat.com/show_bug.cgi?id=1872470) between kube-apiserver and etcd, I found that the issue is with source port collision between the host and pod networked pods.

Consider the following shared gateway mode topology:

172.20.0.4:4444                  172.20.0.3         (snat)                  10.244.1.5
etcd server (node 2)------------(node1)----br-ex----OVN GR----<OVN network>---pod 1
                                              |
                                              | 
                                              |----- host stack (kube-apiserver)

In this case both the host interface and OVN GR share the same IP 172.20.0.3. Kubeapi-server will make many connections a second to etcd server, choosing a random source port. At the same time pod1 may be openshift-apiserver, who also makes many connections to etcd server and chooses a random source port. The openshift-apiserver source IP will get SNAT'ed by the GR to the host IP at 172.20.0.3 At some point both processes will try to connect using the same source port (let's use 8000 for this example). Assume there is already a connection from kube-apiserver to etcd. Normally in Linux, it will choose a different source port during the SNAT because it will notice there is a conflict due to a port already in use:

(local gateway mode output):
root@ovn-worker:/# conntrack -L | grep 4444
tcp      6 86379 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=172.20.0.2 sport=4444 dport=53039 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86379 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=10.244.1.5 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=2 use=1
tcp      6 86379 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=10.244.1.5 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=25 use=1
tcp      6 86293 ESTABLISHED src=172.20.0.2 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=172.20.0.2 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1

We can see in the above, the pod SNAT chose 53039 as the new source port, instead of using 8000. However in shared gateway mode:

root@ovn-worker:/# conntrack -L | grep 4444
tcp      6 86394 ESTABLISHED src=172.20.0.3 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=172.20.0.3 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64000 use=1
tcp      6 86322 ESTABLISHED src=172.20.0.3 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=172.20.0.3 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86394 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=172.20.0.3 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=15 use=1
tcp      6 86394 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8000 dport=4444 src=172.20.0.4 dst=10.244.1.5 sport=4444 dport=8000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=25 use=1
conntrack v1.4.6 (conntrack-tools): 219 flow entries have been shown.

We can see that the source port remained 8000 during the SNAT, because OVN is using a different CT zone. This ends up causing etcd to get packets from different connections in the node1 that look like they are part of the same TCP connection in etcd.

There are 2 possible solutions here:
1. Split the local_port_range in the kernel to 2 segments. Give one slice to OVN for doing SNAT on the gateway into a unique port range.
2. Make OVN configurable to be able to use default zone on GR so that the SNAT will choose another port.

I think option 2 is better.

Comment 1 Mark Michelson 2020-10-29 13:09:47 UTC
Hi Tim, I have a patch ready that does option 2. I'm currently working on making a scratch RPM so you can test it out. I'll give further instructions once I have the RPMs created.

Comment 4 Tim Rozet 2020-11-03 16:19:33 UTC
Thanks Mark. Locally testing your patch with KIND things are working correctly:
conntrack v1.4.6 (conntrack-tools): 211 flow entries have been shown.
tcp      6 86364 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8888 dport=4444 src=172.20.0.4 dst=172.20.0.2 sport=4444 dport=47078 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=2
tcp      6 86364 ESTABLISHED src=10.244.1.5 dst=172.20.0.4 sport=8888 dport=4444 src=172.20.0.4 dst=10.244.1.5 sport=4444 dport=8888 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=27 use=1
tcp      6 86315 ESTABLISHED src=172.20.0.2 dst=172.20.0.4 sport=8888 dport=4444 src=172.20.0.4 dst=172.20.0.2 sport=4444 dport=8888 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 86364 ESTABLISHED src=172.20.0.2 dst=172.20.0.4 sport=47078 dport=4444 src=172.20.0.4 dst=172.20.0.2 sport=4444 dport=47078 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64000 use=1

The pod attempting to also use 8888 was changed to use 47078.

Comment 5 Mark Michelson 2020-11-12 15:39:02 UTC
I've sent an upstream version of the patch here: https://patchwork.ozlabs.org/project/ovn/patch/20201112145621.155336-1-mmichels@redhat.com/

The main difference between it and the version I initially shared with you is that in this one, you set options:snat-ct-zone=<integer> in the logical router. If you want to use the default zone, use "0".

Comment 8 Jianlin Shi 2020-12-24 07:39:21 UTC
tested with following script:

systemctl start openvswitch
systemctl start ovn-northd                                           
ovn-nbctl set-connection ptcp:6641                                    
ovn-sbctl set-connection ptcp:6642                                           
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.161.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.161.25
systemctl restart ovn-controller                                
                                                                           
ovn-nbctl lr-add R1
ovn-nbctl lr-add R2                                                  
ovn-nbctl lr-add R3                                                
                                                                               
ovn-nbctl set logical_router R1 options:chassis=hv1                  
ovn-nbctl set logical_router R2 options:chassis=hv1                
ovn-nbctl set logical_router R3 options:chassis=hv1                            
                                                                     
ovn-nbctl ls-add foo                                               
ovn-nbctl ls-add bar                                                           
ovn-nbctl ls-add alice
ovn-nbctl ls-add bob                             
ovn-nbctl ls-add join                            
                                           
ovn-nbctl lrp-add R1 foo 00:00:01:01:02:03 192.168.1.1/24 2001::1/64
ovn-nbctl lsp-add foo rp-foo -- set logical_switch_port rp-foo \
        type=router options:router-port=foo addresses=\"00:00:01:01:02:03\"

ovn-nbctl lrp-add R1 bar 00:00:01:01:02:04 192.168.2.1/24 2002::1/64
ovn-nbctl lsp-add bar rp-bar -- set Logical_Switch_Port rp-bar \
        type=router options:router-port=bar addresses=\"00:00:01:01:02:04\"
                                            
ovn-nbctl lrp-add R2 alice 00:00:02:01:02:03 172.16.1.1/24 3001::1/64
ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port rp-alice \
        type=router options:router-port=alice addresses=\"00:00:02:01:02:03\"
ovn-nbctl lrp-add R3 bob 00:00:03:01:02:03 172.17.1.1/24 3002::1/64
ovn-nbctl lsp-add bob rp-bob -- set Logical_Switch_Port rp-bob \                                                                                                                                           
        type=router options:router-port=bob addresses=\"00:00:03:01:02:03\"                   
                       
ovn-nbctl lrp-add R1 R1_join 00:00:04:01:02:03 20.0.0.1/24 4000::1/64
ovn-nbctl lsp-add join r1-join -- set Logical_Switch_Port r1-join \                                                                                                                                        
        type=router options:router-port=R1_join addresses='"00:00:04:01:02:03"'               
ovn-nbctl lrp-add R2 R2_join 00:00:04:01:02:04 20.0.0.2/24 4000::2/64
ovn-nbctl lsp-add join r2-join -- set Logical_Switch_Port r2-join \
        type=router options:router-port=R2_join addresses='"00:00:04:01:02:04"'
ovn-nbctl lrp-add R3 R3_join 00:00:04:01:02:05 20.0.0.3/24 4000::3/64
ovn-nbctl lsp-add join r3-join -- set Logical_Switch_Port r3-join \         
        type=router options:router-port=R3_join addresses='"00:00:04:01:02:05"'
ovn-nbctl lr-route-add R2 192.168.0.0/16 20.0.0.1
ovn-nbctl lr-route-add R3 192.168.0.0/16 20.0.0.1
ovn-nbctl lr-route-add R2 2001::/64 4000::1 
ovn-nbctl lr-route-add R2 2002::/64 4000::1
ovn-nbctl lr-route-add R3 2001::/64 4000::1
ovn-nbctl lr-route-add R3 2002::/64 4000::1

ovn-nbctl lr-route-add R2 1.1.1.0/24 172.16.1.3
ovn-nbctl lr-route-add R3 1.1.1.0/24 172.17.1.4
ovn-nbctl lr-route-add R2 1111::/64 3001::3
ovn-nbctl lr-route-add R3 1111::/64 3002::4

ovn-nbctl --wait=hv sync

R1_nb_uuid=$(ovn-nbctl get Logical_Router R1 _uuid)
R1_sb_uuid=$(ovn-sbctl --bare --columns=_uuid find Datapath_Binding external-ids:logical-router=${R1_nb_uuid})                                                                                             
R1_snat_zone=$(ovs-vsctl get bridge br-int external-ids:ct-zone-${R1_sb_uuid}_snat | tr -d \")
echo "R1:$R1_snat_zone"
R2_nb_uuid=$(ovn-nbctl get Logical_Router R2 _uuid)
R2_sb_uuid=$(ovn-sbctl --bare --columns=_uuid find Datapath_Binding external-ids:logical-router=${R2_nb_uuid})                                                                                             
R2_snat_zone=$(ovs-vsctl get bridge br-int external-ids:ct-zone-${R2_sb_uuid}_snat | tr -d \")
echo "R2:$R2_snat_zone"

ovn-nbctl --wait=hv set Logical_Router R1 options:snat-ct-zone=111
ovs-vsctl get bridge br-int external-ids:ct-zone-${R1_sb_uuid}_snat
ovn-nbctl --wait=hv set Logical_Router R1 options:snat-ct-zone=$R2_snat_zone
ovs-vsctl get bridge br-int external-ids:ct-zone-${R1_sb_uuid}_snat
ovs-vsctl get bridge br-int external-ids:ct-zone-${R2_sb_uuid}_snat

result on 20.12.0-1:

[root@wsfd-advnetlab17 bz1892311]# rpm -qa | grep -E "openvswitch2.13|ovn2.13"
ovn2.13-central-20.12.0-1.el8fdp.x86_64
python3-openvswitch2.13-2.13.0-77.el8fdp.x86_64
ovn2.13-host-20.12.0-1.el8fdp.x86_64
ovn2.13-20.12.0-1.el8fdp.x86_64
openvswitch2.13-2.13.0-77.el8fdp.x86_64

++ ovn-nbctl get Logical_Router R1 _uuid
+ R1_nb_uuid=d9d5dfab-b76d-4626-8c82-160056f5dc5a
++ ovn-sbctl --bare --columns=_uuid find Datapath_Binding external-ids:logical-router=d9d5dfab-b76d-4626-8c82-160056f5dc5a
+ R1_sb_uuid=364abef1-03d8-436a-aae7-e3fddfdb0963
++ tr -d '"'
++ ovs-vsctl get bridge br-int external-ids:ct-zone-364abef1-03d8-436a-aae7-e3fddfdb0963_snat
+ R1_snat_zone=7
+ echo R1:7
R1:7
++ ovn-nbctl get Logical_Router R2 _uuid
+ R2_nb_uuid=f2a48d27-c71c-44ed-ab8f-e40c75fff270
++ ovn-sbctl --bare --columns=_uuid find Datapath_Binding external-ids:logical-router=f2a48d27-c71c-44ed-ab8f-e40c75fff270
+ R2_sb_uuid=0cdf4e5a-1fa8-4b92-8894-69ae791d603a
++ ovs-vsctl get bridge br-int external-ids:ct-zone-0cdf4e5a-1fa8-4b92-8894-69ae791d603a_snat
++ tr -d '"'
+ R2_snat_zone=6
+ echo R2:6
R2:6
+ ovn-nbctl --wait=hv set Logical_Router R1 options:snat-ct-zone=111
+ ovs-vsctl get bridge br-int external-ids:ct-zone-364abef1-03d8-436a-aae7-e3fddfdb0963_snat
"111"

<=== changed to 111

+ ovn-nbctl --wait=hv set Logical_Router R1 options:snat-ct-zone=6
+ ovs-vsctl get bridge br-int external-ids:ct-zone-364abef1-03d8-436a-aae7-e3fddfdb0963_snat
"6"

<=== changed to zone id for R2

+ ovs-vsctl get bridge br-int external-ids:ct-zone-0cdf4e5a-1fa8-4b92-8894-69ae791d603a_snat
"7"

<=== zone id for R2 is changed

Comment 9 Jianlin Shi 2020-12-24 08:29:38 UTC
Verified on rhel7 version:

:: [ 03:26:18 ] :: [  BEGIN   ] :: Running 'ovs-vsctl get bridge br-int external-ids:ct-zone-0ab55c34-4b22-4e84-b927-f4aa7b8a7566_snat'                                 
"1"                                                        
:: [ 03:26:18 ] :: [   PASS   ] :: Command 'ovs-vsctl get bridge br-int external-ids:ct-zone-0ab55c34-4b22-4e84-b927-f4aa7b8a7566_snat' (Expected 0, got 0)
:: [ 03:26:18 ] :: [  BEGIN   ] :: Running 'ovs-vsctl get bridge br-int external-ids:ct-zone-50957a04-115d-46b2-8d71-3434ded93ded_snat'                     
"17"                                                                                                                                     
:: [ 03:26:18 ] :: [   PASS   ] :: Command 'ovs-vsctl get bridge br-int external-ids:ct-zone-50957a04-115d-46b2-8d71-3434ded93ded_snat' (Expected 0, got 0)
:: [ 03:26:18 ] :: [  BEGIN   ] :: Running 'ovn-nbctl --wait=hv set Logical_Router R2 options:snat-ct-zone=123'                      
:: [ 03:26:18 ] :: [   PASS   ] :: Command 'ovn-nbctl --wait=hv set Logical_Router R2 options:snat-ct-zone=123' (Expected 0, got 0)
:: [ 03:26:18 ] :: [  BEGIN   ] :: Running 'ovs-vsctl get bridge br-int external-ids:ct-zone-0ab55c34-4b22-4e84-b927-f4aa7b8a7566_snat | grep 123'
"123"                                            
:: [ 03:26:18 ] :: [   PASS   ] :: Command 'ovs-vsctl get bridge br-int external-ids:ct-zone-0ab55c34-4b22-4e84-b927-f4aa7b8a7566_snat | grep 123' (Expected 0, got 0)
:: [ 03:26:18 ] :: [  BEGIN   ] :: Running 'ip netns exec alice1 ping -q 30.0.0.1 -c 1'                                                  
PING 30.0.0.1 (30.0.0.1) 56(84) bytes of data.                                                                                     
                                                                                                                                                         
--- 30.0.0.1 ping statistics ---                                                                                                     
1 packets transmitted, 1 received, 0% packet loss, time 0ms                                                                                                          
rtt min/avg/max/mdev = 2.017/2.017/2.017/0.000 ms                                                                                                 
:: [ 03:26:18 ] :: [   PASS   ] :: Command 'ip netns exec alice1 ping -q 30.0.0.1 -c 1' (Expected 0, got 0)                                                          
:: [ 03:26:19 ] :: [  BEGIN   ] :: Running 'ssh -q wsfd-advnetlab19.anl.lab.eng.bos.redhat.com ip netns exec bob1 ping -q 30.0.0.1 -c 1' 
PING 30.0.0.1 (30.0.0.1) 56(84) bytes of data.                                      
                                                 
--- 30.0.0.1 ping statistics ---                                                                                                                         
1 packets transmitted, 1 received, 0% packet loss, time 0ms                                                               
rtt min/avg/max/mdev = 0.399/0.399/0.399/0.000 ms                                                                                             
:: [ 03:26:19 ] :: [   PASS   ] :: Command 'ssh -q wsfd-advnetlab19.anl.lab.eng.bos.redhat.com ip netns exec bob1 ping -q 30.0.0.1 -c 1' (Expected 0, got 0)
:: [ 03:26:19 ] :: [  BEGIN   ] :: Running 'ip netns exec alice1 ping6 -q 6010::1 -c 1'                                                       
PING 6010::1(6010::1) 56 data bytes                                                                         
                                                                                     
--- 6010::1 ping statistics ---                
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.596/2.596/2.596/0.000 ms

[root@wsfd-advnetlab16 nat]# rpm -qa | grep -E "openvswitch2.13|ovn2.13"
python3-openvswitch2.13-2.13.0-70.el7fdp.x86_64
openvswitch2.13-2.13.0-70.el7fdp.x86_64
ovn2.13-20.12.0-1.el7fdp.x86_64 
ovn2.13-host-20.12.0-1.el7fdp.x86_64
ovn2.13-central-20.12.0-1.el7fdp.x86_64

Comment 11 errata-xmlrpc 2021-02-03 21:55:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0407


Note You need to log in before you can comment on or make changes to this bug.