Description of problem: SCTP traffic in OVS is not leading to an established connection ct_state, even though in conntrack the connection is established. See https://bugzilla.redhat.com/show_bug.cgi?id=1718372 for more context. Version-Release number of selected component (if applicable): openvswitch-2.12.0-1.fc31.x86_64 While sending SCTP traffic from a client to a server, using ncat to open the sctp connection and keep it open I see via tcpdump the connection stays open with heart beats and heart beat ACKs. In conntrack I can see: sctp,orig=(src=169.254.33.1,dst=169.254.33.2,sport=38982,dport=31769),reply=(src=10.244.0.5,dst=169.254.33.1,sport=62324,dport=38982),zone=9,protoinfo=(state=ESTABLISHED,vtag_orig=3615038536,vtag_reply=554870550) sctp,orig=(src=169.254.33.1,dst=169.254.33.2,sport=38982,dport=31769),reply=(src=169.254.33.2,dst=169.254.33.1,sport=31769,dport=38982),protoinfo=(state=ESTABLISHED,vtag_orig=3615038536,vtag_reply=554870550) sctp,orig=(src=169.254.33.1,dst=10.244.0.5,sport=38982,dport=62324),reply=(src=10.244.0.5,dst=100.64.0.1,sport=62324,dport=38982),zone=8,protoinfo=(state=ESTABLISHED,vtag_orig=3615038536,vtag_reply=554870550) sctp,orig=(src=100.64.0.1,dst=10.244.0.5,sport=38982,dport=62324),reply=(src=10.244.0.5,dst=100.64.0.1,sport=62324,dport=38982),zone=15,protoinfo=(state=ESTABLISHED,vtag_orig=3615038536,vtag_reply=554870550) However in OVS dpctl the only flow I see matching sctp hit is: recirc_id(0x1c),in_port(3),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(),eth_type(0x0800),ipv4(dst=169.254.33.2,proto=132,frag=no),sctp(dst=31769), packets:1, bytes:98, used:3.885s, actions:hash(l4(0)),recirc(0xfd) Notice the +new state match there, even though there is no new connection. Additionally I can see that with my flows in table 14: [root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int |grep sctp |grep table=14 cookie=0x3bfb3d5b, duration=81976.388s, table=14, n_packets=0, n_bytes=0, idle_age=65535, priority=120,ct_state=+est+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31769 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat) cookie=0xd84870e, duration=81976.384s, table=14, n_packets=694, n_bytes=67940, idle_age=0, priority=120,ct_state=+new+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31769 actions=set_field:0x8/0x8->reg10,group:1 cookie=0x0, duration=62809.899s, table=14, n_packets=26, n_bytes=3520, idle_age=10326, priority=120,ct_state=-new-est+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31769 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat) The only flow incrementing is the +new flow: cookie=0xd84870e, duration=82037.396s, table=14, n_packets=697, n_bytes=68234, idle_age=3, priority=120,ct_state=+new+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31769 actions=set_field:0x8/0x8->reg10,group:1 However, the flow with -new-est+trk is required for the COOKIE and COOKIE ACK to work: cookie=0x0, duration=62870.911s, table=14, n_packets=26, n_bytes=3520, idle_age=10387, priority=120,ct_state=-new-est+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31769 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat) Something is wrong with how OVS is determining ct_state for SCTP conntrack sessions.
I tested with the following setup (no modifications to kernel or ovs): # using kernel 5.6.0-rc6+, ovs master built using make rpm-fedora and installed ip netns add left ip netns add right ip link add center-left type veth peer name left0 ip link add center-right type veth peer name right0 ip link set center-left up ip link set center-right up ip link set left0 netns left ip link set right0 netns right ip netns exec left ip addr add 172.31.110.1/30 dev left0 ip netns exec right ip addr add 172.31.110.2/30 dev right0 ip netns exec left ip link set left0 up ip netns exec right ip link set right0 up # just to ignore any possible selinux issues... setenforce Permissive systemctl start openvswitch systemctl start openvswitch ovs-vsctl add-br br0 -- set Bridge br0 fail-mode=secure ovs-vsctl add-port br0 center-left ovs-vsctl add-port br0 center-right ovs-ofctl add-flow br0 table=0,arp,action=NORMAL ovs-ofctl add-flow br0 'table=0,sctp,actions=ct(table=1)' ovs-ofctl add-flow br0 \ 'table=1,sctp,in_port=center-left,ct_state=+trk+new,actions=ct(commit),center-right' ovs-ofctl add-flow br0 \ 'table=1,sctp,in_port=center-right,ct_state=+rpl+trk,actions=center-left' ovs-ofctl add-flow br0 \ 'table=1,sctp,in_port=center-left,ct_state=+trk+est,actions=center-right' ovs-ofctl add-flow br0 \ 'table=1,sctp,in_port=center-right,ct_state=+trk+est,actions=center-left' # ensure arp is following action normal ip netns exec left arping 172.31.110.2 -I left0 # in one terminal ip netns exec right ncat --listen --sctp -vv # in another terminal ip netns exec left ncat --sctp 172.31.110.2 31337 Result: [root@wsfd-netdev92 ~]# ip netns exec right ncat --listen --sctp -vv Ncat: Version 7.70 ( https://nmap.org/ncat ) Ncat: Listening on :::31337 Ncat: Listening on 0.0.0.0:31337 Ncat: Connection from 172.31.110.1. Ncat: Connection from 172.31.110.1:34461. asdf fdsa fasdfsadf asdfasdfasdfda Seems I have bidirectional communications... It looks like you need the +rpl flow to match replies (which is what I would expect). Looking at the dpctl flows, I see the following display (periodically): recirc_id(0x2b),in_port(3),ct_state(-new+rpl+trk),eth(),eth_type(0x0800),ipv4(proto=132,frag=no), packets:1, bytes:98, used:4.310s, actions:2 recirc_id(0x2c),in_port(2),ct_state(-new+est-rpl+trk),eth(),eth_type(0x0800),ipv4(proto=132,frag=no), packets:1, bytes:98, used:4.314s, actions:3 And from dump-conntrack: sctp,orig=(src=172.31.110.1,dst=172.31.110.2,sport=34461,dport=31337),reply=(src=172.31.110.2,dst=172.31.110.1,sport=31337,dport=34461),protoinfo=(state=ESTABLISHED,vtag_orig=2708668805,vtag_reply=1149194430) I will retry with the rhel8 kernel, but I think this is a flow programming issue on OVN side.
I've retried with the rhel8 kernel, and the openvswitch2.12-2.12.0-22, still don't get any issues. I thought maybe the +rpl is what is enabling it, but even if I omit that, it works. I also thought maybe +inv+trk is catching things, but even with that, I can establish connection and pass data. cookie=0x0, duration=184.091s, table=0, n_packets=4, n_bytes=168, arp actions=NORMAL cookie=0x0, duration=180.106s, table=0, n_packets=28, n_bytes=2648, sctp actions=ct(table=1) cookie=0x0, duration=173.104s, table=1, n_packets=2, n_bytes=164, ct_state=+new+trk,sctp,in_port="center-left" actions=ct(commit),output:"center-right" cookie=0x0, duration=166.284s, table=1, n_packets=11, n_bytes=1106, ct_state=+est+trk,sctp,in_port="center-left" actions=output:"center-right" cookie=0x0, duration=140.508s, table=1, n_packets=15, n_bytes=1378, ct_state=+est+trk,sctp,in_port="center-right" actions=output:"center-left" cookie=0x0, duration=41.354s, table=1, n_packets=0, n_bytes=0, ct_state=+inv+trk,sctp actions=drop Not sure what to say at this point. There's definitely a problem with the openflow rules that OVN is installing, I think.
Finally, it might be good to check whether there is an SCTP offload or something else hardware-wise which is causing problems.
Changing ownership and component to Mark / OVN. Feel free to reassign if I missed something and you have a minimum set of flows that reproduces the issue.
Hi Aaron, thanks for the tests. The confusing thing from OVN's perspective is that the flows we install right now work fine for TCP, but they are foiled by SCTP. That's what gives the impression from our side that SCTP has some issues that TCP doesn't. Or at least, SCTP has differences that need to be accounted for in our installed flows. It's just not obvious what those differences are. We presumably could take the flows OVN produces and distill them down to just the set of flows that exhibit the problem. If we can provide those for you, then hopefully that would help.
I tried manually diverting traffic in the broken setup to new tables where I manually programmed flows. Everything seems to work fine. est matches and sctp client/server connection works: [root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int | egrep 'table=(79|80|81|82|83)' cookie=0x0, duration=1931.902s, table=0, n_packets=70, n_bytes=9452, idle_age=15, priority=20000,sctp actions=ct(table=79) cookie=0x0, duration=4772.004s, table=79, n_packets=71, n_bytes=5822, idle_age=275, priority=30000,ct_state=+new+trk,sctp,nw_src=169.254.33.1,nw_dst=169.254.33.2 actions=ct(commit,table=80,zone=1337,nat(src=100.64.0.1)) cookie=0x0, duration=83.512s, table=79, n_packets=6, n_bytes=588, idle_age=15, priority=30000,ct_state=+new+trk,sctp,nw_src=10.244.0.5,nw_dst=100.64.0.1 actions=ct(table=81,zone=1338,nat) cookie=0x0, duration=358.013s, table=79, n_packets=16, n_bytes=1748, idle_age=15, priority=30000,ct_state=+est+trk,sctp,nw_src=169.254.33.1,nw_dst=169.254.33.2 actions=ct(table=81,zone=1337,nat) cookie=0x0, duration=10.401s, table=79, n_packets=0, n_bytes=0, idle_age=10, priority=30000,ct_state=+trk,sctp,nw_src=10.244.0.5,nw_dst=100.64.0.1 actions=ct(table=81,zone=1338,nat) cookie=0x0, duration=2204.114s, table=80, n_packets=29, n_bytes=2378, idle_age=275, ct_state=+new+trk,sctp,nw_dst=169.254.33.2 actions=ct(table=81,zone=1337,nat) cookie=0x0, duration=1444.205s, table=81, n_packets=11, n_bytes=902, idle_age=275, priority=10,ct_state=+new+trk,sctp,nw_dst=169.254.33.2 actions=group:82 cookie=0x0, duration=627.910s, table=81, n_packets=17, n_bytes=2034, idle_age=15, priority=10,sctp,nw_dst=100.64.0.1 actions=ct(table=83,zone=1337,nat) cookie=0x0, duration=287.405s, table=81, n_packets=16, n_bytes=1748, idle_age=15, priority=10,ct_state=+est+trk,sctp,nw_dst=169.254.33.2 actions=ct(table=83,zone=1338,nat) cookie=0x0, duration=1752.007s, table=82, n_packets=0, n_bytes=0, idle_age=1752, priority=20000,ct_state=+new+trk,sctp,nw_dst=169.254.33.2 actions=ct(table=83,zone=1338,nat) cookie=0x0, duration=848.908s, table=82, n_packets=2, n_bytes=164, idle_age=275, priority=1000,ct_state=+new+trk,sctp actions=ct(table=83,zone=1338,nat) cookie=0x0, duration=4209.814s, table=83, n_packets=60, n_bytes=5356, idle_age=15, priority=30000,sctp,nw_dst=10.244.0.5 actions=set_field:72:93:2f:f4:00:06->eth_dst,set_field:0a:58:64:40:00:01->eth_src,output:5 cookie=0x0, duration=3675.708s, table=83, n_packets=53, n_bytes=11706, idle_age=15, priority=10,sctp actions=set_field:00:00:a9:fe:21:01->eth_dst,set_field:9e:4d:9a:a7:67:46->eth_src,output:2 [root@ovn-control-plane ~]# [root@ovn-control-plane ~]# [root@ovn-control-plane ~]# [root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-groups br-int |grep 82 group_id=82,type=select,bucket=bucket_id:0,weight:100,actions=ct(commit,table=82,zone=1338,nat(dst=10.244.0.5:62324)) So I think it must be something with OVN flows.
But then again if I add an inv ct match flow, everything works... [root@ovn-control-plane ~]# ovs-ofctl -O openflow15 add-flow br-int 'table=14,priority=120,ct_state=-new+inv+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31291,actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)' [root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int |grep table=14 |grep sctp cookie=0x7ee7a259, duration=80.031s, table=14, n_packets=0, n_bytes=0, idle_age=80, priority=120,ct_state=+est+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31291 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat) cookie=0xee91972c, duration=80.031s, table=14, n_packets=3, n_bytes=278, idle_age=8, priority=120,ct_state=+new+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31291 actions=set_field:0x8/0x8->reg10,group:1 cookie=0x0, duration=49.400s, table=14, n_packets=4, n_bytes=464, idle_age=43, priority=120,ct_state=-new+inv+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31291 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)
(In reply to Tim Rozet from comment #7) > But then again if I add an inv ct match flow, everything works... Can you capture these packets that match +inv? Maybe by mirroring them to another port or so. Another idea is to enable netfilter debugs, net.netfilter.nf_conntrack_log_invalid. how-to: https://access.redhat.com/solutions/796763
Here is the capture: https://www.cloudshark.org/captures/db7fb831dab2 I think I may see what is wrong. The ct matching invalid means Invalid (inv): The connection tracker couldn't identify the connection. The flow before table 14 is in table 13: Flow: recirc_id=0x185,ct_state=est|trk,ct_zone=9,eth,sctp,reg9=0x8,reg11=0x9,reg12=0x6,reg14=0x2,metadata=0x3,in_port=2,vlan_tci=0x0000,dl_src=00:00:a9:fe:21:01,dl_dst=9e:4d:9a:a7:67:46,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=0,tp_dst=31790 bridge("br-int") ---------------- thaw Resuming from table 13 13. ip,metadata=0x3,nw_dst=169.254.33.2, priority 90, cookie 0x85a8499 ct(table=14,zone=NXM_NX_REG12[0..15],nat) nat This is sending to zone in reg12, which is 6. In OVN I can see that CT zone is SNAT: "ct-zone-d13ee111-a5a1-4bfd-87ce-23bc1dac34c1_snat"="6" This means that packets from 169.254.33.1 -> 169.254.33.2 should be SNAT to the router ip 100.64.0.1. However I see in the conntrack table: [root@ovn-control-plane ~]# ovs-appctl dpctl/dump-conntrack |grep sctp sctp,orig=(src=169.254.33.1,dst=169.254.33.2,sport=43357,dport=31291),reply=(src=10.244.0.5,dst=169.254.33.1,sport=62324,dport=43357),zone=9,protoinfo=(state=ESTABLISHED,vtag_orig=924887357,vtag_reply=544223573) sctp,orig=(src=169.254.33.1,dst=169.254.33.2,sport=43357,dport=31291),reply=(src=169.254.33.2,dst=169.254.33.1,sport=31291,dport=43357),protoinfo=(state=ESTABLISHED,vtag_orig=924887357,vtag_reply=544223573) sctp,orig=(src=169.254.33.1,dst=10.244.0.5,sport=43357,dport=62324),reply=(src=10.244.0.5,dst=100.64.0.1,sport=62324,dport=43357),zone=6,protoinfo=(state=ESTABLISHED,vtag_orig=924887357,vtag_reply=544223573) sctp,orig=(src=100.64.0.1,dst=10.244.0.5,sport=43357,dport=62324),reply=(src=10.244.0.5,dst=100.64.0.1,sport=62324,dport=43357),zone=15,protoinfo=(state=ESTABLISHED,vtag_orig=924887357,vtag_reply=544223573) This entry does not look right: sctp,orig=(src=169.254.33.1,dst=10.244.0.5,sport=43357,dport=62324),reply=(src=10.244.0.5,dst=100.64.0.1,sport=62324,dport=43357),zone=6,protoinfo=(state=ESTABLISHED,vtag_orig=924887357,vtag_reply=544223573) This looks wrong to me because this session indicates that the DNAT has already occurred, which it has not. That will happen in table 14. So I don't see how CT matches here, and thus may now consider the packet invalid. I'll also try enabling the logging.
(In reply to Tim Rozet from comment #9) > Here is the capture: > https://www.cloudshark.org/captures/db7fb831dab2 All these matched +inv!? O.O Btw, seems the original packets from server always have sctp crc32 zeroed out, but then the NATed one has it filled in, with the exception of COOKIE_ACK (frame 8) and HEARTBEAT_ACK (frame 16). This flea is still itching behind my ears...
So I think this is definitely the problem. The zones are inverse as I mentioned in comment 9. If I add my own flows to fix it: [root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int table=13 |grep sctp cookie=0x0, duration=1013.896s, table=13, n_packets=33, n_bytes=4490, idle_age=9, priority=95,sctp,metadata=0x3,nw_dst=169.254.33.2 actions=ct(table=14,zone=NXM_NX_REG11[0..15],nat) [root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int table=14 |grep REG cookie=0x0, duration=44.396s, table=14, n_packets=3, n_bytes=474, idle_age=4, priority=125,ct_state=+est+trk,sctp,metadata=0x3,nw_src=169.254.33.1 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG12[0..15],nat) Notice now zone=REG11 (zone 9, DNAT) is happening first in table 13, and then in table 14 +est now matches, and then I do nat on zone=REG12 (zone 6, SNAT). The connection then works fine. I think we can close this bug and I'll update: https://bugzilla.redhat.com/show_bug.cgi?id=1718372
Actually I think we should keep this bug open and track this separately in OVN. It looks to be a bug with TCP as well, just somehow TCP traffic still works. Look at this trace: =============================================================================== recirc(0x867) - resume conntrack with ct_state=est|trk =============================================================================== Flow: recirc_id=0x867,ct_state=est|trk,ct_zone=9,eth,tcp,reg9=0x8,reg11=0x9,reg12=0x6,reg14=0x2,metadata=0x3,in_port=2,vlan_tci=0x0000,dl_src=00:00:a9:fe:21:01,dl_dst=9e:4d:9a:a7:67:46,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=0,tp_dst=31586,tcp_flags=0 bridge("br-int") ---------------- thaw Resuming from table 13 13. ip,metadata=0x3,nw_dst=169.254.33.2, priority 90, cookie 0x85a8499 ct(table=14,zone=NXM_NX_REG12[0..15],nat) nat -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 14. -> Sets the packet to an untracked state, and clears all the conntrack fields. Final flow: recirc_id=0x867,eth,tcp,reg9=0x8,reg11=0x9,reg12=0x6,reg14=0x2,metadata=0x3,in_port=2,vlan_tci=0x0000,dl_src=00:00:a9:fe:21:01,dl_dst=9e:4d:9a:a7:67:46,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=0,tp_dst=31586,tcp_flags=0 Megaflow: recirc_id=0x867,eth,tcp,in_port=2,nw_dst=169.254.33.2,nw_frag=no Datapath actions: ct(zone=6,nat),recirc(0x868) =============================================================================== recirc(0x868) - resume conntrack with ct_state=est|trk =============================================================================== Flow: recirc_id=0x868,ct_state=est|trk,ct_zone=6,eth,tcp,reg9=0x8,reg11=0x9,reg12=0x6,reg14=0x2,metadata=0x3,in_port=2,vlan_tci=0x0000,dl_src=00:00:a9:fe:21:01,dl_dst=9e:4d:9a:a7:67:46,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=0,tp_dst=31586,tcp_flags=0 bridge("br-int") ---------------- thaw Resuming from table 14 14. ct_state=+est+trk,tcp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31586, priority 120, cookie 0xaa1ed9ed load:0x1->NXM_NX_REG10[3] ct(table=15,zone=NXM_NX_REG11[0..15],nat) nat -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 15. -> Sets the packet to an untracked state, and clears all the conntrack fields. Final flow: recirc_id=0x868,eth,tcp,reg9=0x8,reg10=0x8,reg11=0x9,reg12=0x6,reg14=0x2,metadata=0x3,in_port=2,vlan_tci=0x0000,dl_src=00:00:a9:fe:21:01,dl_dst=9e:4d:9a:a7:67:46,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=0,tp_dst=31586,tcp_flags=0 Megaflow: recirc_id=0x868,ct_state=-new+est-rel-rpl-inv+trk,ct_label=0/0x1,eth,tcp,in_port=2,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_frag=no,tp_dst=31586 Datapath actions: ct(zone=9,nat),recirc(0x869) I can see the flow in table 14 never gets hit for est traffic either with TCP: [root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int table=13 |grep 0x85a8499 cookie=0x85a8499, duration=9084.977s, table=13, n_packets=22838, n_bytes=11380232, idle_age=1, priority=90,ip,metadata=0x3,nw_dst=169.254.33.2 actions=ct(table=14,zone=NXM_NX_REG12[0..15],nat) [root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int table=14 |grep 0xaa1ed9ed cookie=0xaa1ed9ed, duration=1282.030s, table=14, n_packets=0, n_bytes=0, idle_age=1282, priority=120,ct_state=+est+trk,tcp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31586 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat) Then adding an invalid flow it matches... cookie=0x0, duration=13.198s, table=14, n_packets=1, n_bytes=66, idle_age=8, priority=120,ct_state=-new+inv+trk,tcp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31586 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)
BTW, here's a relevant comment from the OVN northd code: /* Ingress UNSNAT table: It is for already established connections' * reverse traffic. i.e., SNAT has already been done in egress * pipeline and now the packet has entered the ingress pipeline as * part of a reply. We undo the SNAT here. * * Undoing SNAT has to happen before DNAT processing. This is * because when the packet was DNATed in ingress pipeline, it did * not know about the possibility of eventual additional SNAT in * egress pipeline. */ To translate, UNSNAT is performed in table 13, and DNAT is performed in table 14. The comment makes sense. Consider that SNAT and DNAT are configured on a gateway router. When initial external traffic comes into the router, it gets DNATted. Then the reply traffic gets UNDNATted (table 40) and then SNATted (table 41) on the way out. Now as future external traffic arrives at the router, it needs to get UNSNATted first because the DNAT was initially created with no knowledge of the SNAT that would later be applied. In practice, something is fishy here. You're saying you have to DNAT first in order for conntrack to be happy. My hunch is that the reply traffic is not getting UNDNATted in table 40. Tim, could you dump your OpenFlow tables 40 and 41 so we can see what's happening there?
Sure the relevant table 41 flows are: cookie=0x54b52682, duration=73327.780s, table=41, n_packets=412, n_bytes=40462, priority=100,ip,reg10=0x8/0x8,metadata=0x3 actions=ct(commit,table=42,zone=NXM_NX_REG12[0..15],nat(src=100.64.0.1)) cookie=0x54b52682, duration=73327.776s, table=41, n_packets=0, n_bytes=0, priority=100,ipv6,reg10=0x8/0x8,metadata=0x3 actions=ct(commit,table=42,zone=NXM_NX_REG12[0..15],nat(src=100.64.0.1)) The above is effectively the same result. The key part of that comment from northd is the 2nd paragraph. The SNAT would happen in the egress pipeline. If you ignore table 13 (incorrectly matching), table 14 would do DNAT, then table 41 in egress pipeline would do SNAT, which is the the order I changed things so that it would work. I just did it in table 13, when table 13 flow should really not get hit at all (UNSNAT). The problem is, nothing should be being UNSNAT, this traffic is destined to be NAT'ed. I think the lflow Numan said corresponded is just matching the packet on accident: table=5 (lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 169.254.33.2), action=(ct_snat;) The traffic flow here is: 169.254.33.1 -------> 169.254.33.3 - router - 100.64.0.1 ------------->10.244.0.5 client server packet destined from client -> server is not NAT'ed yet, and accidentally matches the UNSNAT flow in table 13, except it matches no session in CT, and gets marked as invalid (because the faulty match is trying to SNAT before DNAT)
I meant in the last comment router ip is 169.254.33.2*
I found the fix for this issue. The reason this happens is because - First dnat happens - Then snat happens because of lb_force_snat_ip is set on the gateway router. The first SCTP INIT and SCTP ACK goes fine without any issues. When SCTP COOKIE ECHO is sent, the below lflow is hit, since in this issue, the dnat_and_snat entry is also used as LB VIP table=5 (lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 169.254.33.2), action=(ct_snat;) The above lflow causes, the destination ip to be natted to 10.244.0.5. And this causes the below lflow to miss. table=6 (lr_in_dnat ), priority=120 , match=(ct.est && ip && ip4.dst == 169.254.33.2 && sctp && sctp.dst == 31291), action=(flags.force_snat_for_lb = 1; ct_dnat;) Since lags.force_snat_for_lb is not set, the snat of the src ip doesn't happen. Adding the below lflows fixes this issue table=5 (lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 169.254.33.2 && sctp.dst == 31291), action=(next;) https://github.com/numansiddique/ovn/commit/81b0474606e3ef66f4cba903e0a84ec710f89e38
The fix works. Can we get that and Mark's patch merged upstream with a new F31 build? Then I can get it to pass in ovn-k8s CI. Thanks.
Thanks Numan! I think we should get your patch added to my patch and make it a single patch series for review upstream.
Mark - I think the issue is seen even for TCP. So I submitted the fix here - https://patchwork.ozlabs.org/patch/1261948/ Can you please rebase your patch on top of this and test it out ? To summarize, I don't think this is an OVN issue. But we have a way to avoid this and the patch I submitted does that. But looks like there is an issue in ovs-vswitchd. I've pasted the commit message below which has some details ***** Suppose there is below NAT entry with external_ip = 172.168.0.100 nat <UUID> external ip: "172.168.0.100" logical ip: "10.0.0.0/24" type: "snat" And a load balancer with the VIP - 172.168.0.100 _uuid : <UUID> external_ids : {} name : lb1 protocol : tcp vips : {"172.168.0.100:8080"="10.0.0.4:8080"} And if these are associated to a gateway logical router Then we will see the below lflows in the router pipeline ... table=5 (lr_in_unsnat ), priority=90 , match=(ip && ip4.dst == 172.168.0.100), action=(ct_snat;) ... table=6 (lr_in_dnat ), priority=120 , match=(ct.new && ip && ip4.dst == 172.168.0.100 && tcp && tcp.dst == 8080), action=(ct_lb(10.0.0.4:8080);) table=6 (lr_in_dnat ), priority=120 , match=(ct.est && ip && ip4.dst == 172.168.0.100 && tcp && tcp.dst == 8080), action=(ct_dnat;) When a new connection packet destinated for the lb vip 172.168.0.100 and tcp.dst = 8080 is received, the ct.new flow in the lr_in_dnat is hit and the packet's ip4.dst is dnatted to 10.0.0.4 in the dnat conntrack zone. But for the subsequent packet destined to the vip, the ct.est lflow in the lr_in_dnat stage doesn't get hit. In this case, the packet first hits the lr_in_unsnat pri 90 flow as mentioned above with the action ct_snat. Even though ct_snat should have no effect, but looks like there is some issue in either ovs-vswitchd or in datapath, and the ct flags are not set properly. In the case of tcp, the ct.new flow is hit instead of ct.est. In the the case of sctp, neither of the above lflows in lr_in_dnat stage hit. This needs to be investigated further. But we can avoid this scenario in OVN by adding the below lflow. table=5 (lr_in_unsnat ), priority=120 , match=(ip4 && ip4.dst == 172.168.0.100 && tcp.dst == 8080), action=(next;) ****
"Even though ct_snat should have no effect, but looks like there is some issue in either ovs-vswitchd or in datapath, and the ct flags are not set properly." I think it's debatable about whether that should have no effect. However, it definitely should have a consistent effect. In TCP's case, it appears to reset the ct flags. In SCTP's case, it sets the ct state to invalid. I think the SCTP behavior makes sense. If we try to ct_snat but there is no conntrack entry, then we're trying something invalid. I'll save patch comments for the ML. Thanks Numan!
IMO this is incorrect: "Even though ct_snat should have no effect, but looks like there is some issue in either ovs-vswitchd or in datapath, and the ct flags are not set properly." It most certainly has an effect. Look at comment https://bugzilla.redhat.com/show_bug.cgi?id=1815217#c9 By sending to CT where there is no match, the packet is marked as CT invalid, which will not match the +est flow. This is proper behavior of OVS.
"In TCP's case, it appears to reset the ct flags." It also sets the ct state to invalid, see: cookie=0x0, duration=13.198s, table=14, n_packets=1, n_bytes=66, idle_age=8, priority=120,ct_state=-new+inv+trk,tcp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31586 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat) and comment: https://bugzilla.redhat.com/show_bug.cgi?id=1815217#c12
Ah, OK, I misinterpreted this part of the commit message: "In the case of tcp, the ct.new flow is hit instead of ct.est" I suppose both are setting the invalid flag, but TCP is also setting/keeping the new flag as well.
Wait, but your matching flow for table 14 had -new+inv, which indicates that TCP doesn't have the new flag set. Is the commit message just incorrect?
It's the same issue for both protocols. TCP just happens to still make it to the endpoint because of some other flows in the pipeline handling it accidentally. I didn't bother to figure out what that path is, as I don't see a need to.
Do we need this in downstream OVN 2.12 ? How about we backport this to only OVN 2.13 ?
set the env: client: ovs-vsctl add-br br-provider ovs-vsctl add-port br-provider $nic_test2 ip link set $nic_test2 up ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider ip link set br-provider up ip netns add client1 ip link add veth0_c1_p type veth peer name veth0_c1 netns client1 ip link set veth0_c1_p up ovs-vsctl add-port br-provider veth0_c1_p ip netns exec client1 ip link set lo up ip netns exec client1 ip link set veth0_c1 up ip netns exec client1 ip addr add 172.16.1.2/24 dev veth0_c1 ip netns exec client1 ip addr add 2002::2/64 dev veth0_c1 server: ovs-vsctl add-br br-provider ovs-vsctl add-port br-provider $nic_test2 ip link set $nic_test2 up ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider ip link set br-provider up ip netns add client0 ip link add veth0_c0_p type veth peer name veth0_c0 netns client0 ip link set veth0_c0_p up ovs-vsctl add-port br-provider veth0_c0_p ip netns exec client0 ip link set lo up ip netns exec client0 ip link set veth0_c0 up ip netns exec client0 ip addr add 172.16.1.1/24 dev veth0_c0 ip netns exec client0 ip addr add 2002::1/64 dev veth0_c0 ovn-nbctl ls-add ls1 ovn-nbctl lsp-add ls1 ls1p1 ovn-nbctl lsp-set-addresses ls1p1 00:01:02:01:01:01 ovn-nbctl lsp-add ls1 ls1p2 ovn-nbctl lsp-set-addresses ls1p2 00:01:02:01:01:02 ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal ip netns add server0 ip link set vm1 netns server0 ip netns exec server0 ip link set lo up ip netns exec server0 ip link set vm1 up ip netns exec server0 ip link set vm1 address 00:01:02:01:01:01 ip netns exec server0 ip addr add 192.168.0.1/24 dev vm1 ip netns exec server0 ip addr add 3001::1/64 dev vm1 ip netns exec server0 ip route add default via 192.168.0.254 dev vm1 ip netns exec server0 ip -6 route add default via 3001::a dev vm1 ovs-vsctl set Interface vm1 external_ids:iface-id=ls1p1 ovs-vsctl add-port br-int vm2 -- set interface vm2 type=internal ip netns add server1 ip link set vm2 netns server1 ip netns exec server1 ip link set lo up ip netns exec server1 ip link set vm2 up ip netns exec server1 ip link set vm2 address 00:01:02:01:01:02 ip netns exec server1 ip addr add 192.168.0.2/24 dev vm2 ip netns exec server1 ip addr add 3001::2/64 dev vm2 ip netns exec server1 ip route add default via 192.168.0.254 dev vm2 ip netns exec server1 ip -6 route add default via 3001::a dev vm2 ovs-vsctl set Interface vm2 external_ids:iface-id=ls1p2 ovn-nbctl lr-add lr1 ovn-nbctl lrp-add lr1 lr1ls1 00:01:02:0d:01:01 192.168.0.254/24 3001::a/64 ovn-nbctl lsp-add ls1 ls1lr1 ovn-nbctl lsp-set-type ls1lr1 router ovn-nbctl lsp-set-options ls1lr1 router-port=lr1ls1 ovn-nbctl lsp-set-addresses ls1lr1 00:01:02:0d:01:01 192.168.0.254 3001::a #add sctp lb with fip ovn-nbctl ls-add public ovn-nbctl lsp-add public plr1 ovn-nbctl lsp-set-type plr1 router ovn-nbctl lsp-set-options plr1 router-port=lr1p ovn-nbctl lsp-set-addresses plr1 00:01:02:0d:0f:01 172.16.1.254 2002::a ovn-nbctl lsp-add public ln_public ovn-nbctl lsp-set-type ln_public localnet ovn-nbctl lsp-set-addresses ln_public unknown ovn-nbctl lsp-set-options ln_public network_name=provider ovn-nbctl lrp-set-gateway-chassis lr1p hv1 20 ovn-nbctl lrp-set-gateway-chassis lr1p hv0 10 ovn-nbctl lr-nat-add lr1 dnat_and_snat 172.16.1.10 192.168.2.1 ovn-nbctl lr-nat-add lr1 dnat_and_snat 2002::100 3000::100 ip netns exec server0 ping 192.168.0.2 -c 3 ovn-nbctl show ovn-sbctl show sleep 10 ovn-nbctl lb-add lb1 192.168.2.1:80 192.168.0.1:80,192.168.0.2:80 ovn-nbctl lb-add lb1 172.16.1.10:80 192.168.0.1:80,192.168.0.2:80 ovn-nbctl lb-add lb1 [3000::100]:80 [3001::1]:80,[3001::2]:80 ovn-nbctl lb-add lb1 [2002::100]:80 [3001::1]:80,[3001::2]:80 ovn-nbctl lr-lb-add lr1 lb1 uuid=`ovn-nbctl list Load_Balancer |grep uuid|awk '{printf $3}'` ovn-nbctl set Load_Balancer $uuid protocol=tcp I reproduced this bug on version: # rpm -qa|grep ovn ovn2.13-host-2.13.0-7.el8fdp.x86_64 ovn2.13-central-2.13.0-7.el8fdp.x86_64 ovn2.13-2.13.0-7.el8fdp.x86_64 start ncat server on server: ip netns exec server0 nc -l 80 -k & ip netns exec server1 nc -l 80 -k & start ncat client on client: ip netns exec client1 nc --sctp 172.16.1.10 80& then,on server, ovs-ofctl -O openflow15 dump-flows br-int |grep tcp |grep table=14 |grep 172.16.1.10 cookie=0x8824cf34, duration=2727.250s, table=14, n_packets=0, n_bytes=0, idle_age=2727, priority=120,ct_state=+est+trk,tcp,metadata=0x3,nw_dst=172.16.1.10,tp_dst=80 actions=ct(table=15,zone=NXM_NX_REG11[0..15],nat) cookie=0xccb79a1d, duration=2727.251s, table=14, n_packets=5, n_bytes=346, idle_age=459, priority=120,ct_state=+new+trk,tcp,metadata=0x3,nw_dst=172.16.1.10,tp_dst=80 actions=group:1 in the first flow,ct_state=+est+trk,and the n_packets=0,the packet didn't increase correctly. all packets increase to the second flow verified on version: # rpm -qa|grep ovn ovn2.13-host-2.13.0-11.el8fdp.x86_64 ovn2.13-central-2.13.0-11.el8fdp.x86_64 ovn2.13-2.13.0-11.el8fdp.x86_64 # ovs-ofctl -O openflow15 dump-flows br-int |grep tcp |grep table=14 |grep 172.16.1.10 cookie=0x4ac41557, duration=366.568s, table=14, n_packets=3, n_bytes=198, idle_age=20, priority=120,ct_state=+est+trk,tcp,metadata=0x3,nw_dst=172.16.1.10,tp_dst=80 actions=ct(table=15,zone=NXM_NX_REG11[0..15],nat) cookie=0x68475f08, duration=366.568s, table=14, n_packets=2, n_bytes=148, idle_age=20, priority=120,ct_state=+new+trk,tcp,metadata=0x3,nw_dst=172.16.1.10,tp_dst=80 actions=group:2 here we can see that in the first flow ct_state=+est+trk,the packets counts increase correctly.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1501