Bug 1815217 - SCTP/TCP traffic in OVS never gets "est" ct_state if the load balancer VIP has an entry in NAT table
Summary: SCTP/TCP traffic in OVS never gets "est" ct_state if the load balancer VIP ha...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.13
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ---
Assignee: Numan Siddique
QA Contact: ying xu
URL:
Whiteboard:
Depends On:
Blocks: 1771572 1718372
TreeView+ depends on / blocked
 
Reported: 2020-03-19 19:10 UTC by Tim Rozet
Modified: 2020-04-20 19:43 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-20 19:43:23 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:1501 None None None 2020-04-20 19:43:43 UTC

Description Tim Rozet 2020-03-19 19:10:56 UTC
Description of problem:
SCTP traffic in OVS is not leading to an established connection ct_state, even though in conntrack the connection is established. See
https://bugzilla.redhat.com/show_bug.cgi?id=1718372

for more context.


Version-Release number of selected component (if applicable):
openvswitch-2.12.0-1.fc31.x86_64


While sending SCTP traffic from a client to a server, using ncat to open the sctp connection and keep it open I see via tcpdump the connection stays open with heart beats and heart beat ACKs. In conntrack I can see:

sctp,orig=(src=169.254.33.1,dst=169.254.33.2,sport=38982,dport=31769),reply=(src=10.244.0.5,dst=169.254.33.1,sport=62324,dport=38982),zone=9,protoinfo=(state=ESTABLISHED,vtag_orig=3615038536,vtag_reply=554870550)
sctp,orig=(src=169.254.33.1,dst=169.254.33.2,sport=38982,dport=31769),reply=(src=169.254.33.2,dst=169.254.33.1,sport=31769,dport=38982),protoinfo=(state=ESTABLISHED,vtag_orig=3615038536,vtag_reply=554870550)
sctp,orig=(src=169.254.33.1,dst=10.244.0.5,sport=38982,dport=62324),reply=(src=10.244.0.5,dst=100.64.0.1,sport=62324,dport=38982),zone=8,protoinfo=(state=ESTABLISHED,vtag_orig=3615038536,vtag_reply=554870550)
sctp,orig=(src=100.64.0.1,dst=10.244.0.5,sport=38982,dport=62324),reply=(src=10.244.0.5,dst=100.64.0.1,sport=62324,dport=38982),zone=15,protoinfo=(state=ESTABLISHED,vtag_orig=3615038536,vtag_reply=554870550)

However in OVS dpctl the only flow I see matching sctp hit is:
recirc_id(0x1c),in_port(3),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(),eth_type(0x0800),ipv4(dst=169.254.33.2,proto=132,frag=no),sctp(dst=31769), packets:1, bytes:98, used:3.885s, actions:hash(l4(0)),recirc(0xfd) 

Notice the +new state match there, even though there is no new connection.

Additionally I can see that with my flows in table 14:

[root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int |grep sctp |grep table=14
 cookie=0x3bfb3d5b, duration=81976.388s, table=14, n_packets=0, n_bytes=0, idle_age=65535, priority=120,ct_state=+est+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31769 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)
 cookie=0xd84870e, duration=81976.384s, table=14, n_packets=694, n_bytes=67940, idle_age=0, priority=120,ct_state=+new+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31769 actions=set_field:0x8/0x8->reg10,group:1
 cookie=0x0, duration=62809.899s, table=14, n_packets=26, n_bytes=3520, idle_age=10326, priority=120,ct_state=-new-est+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31769 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)

The only flow incrementing is the +new flow:

cookie=0xd84870e, duration=82037.396s, table=14, n_packets=697, n_bytes=68234, idle_age=3, priority=120,ct_state=+new+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31769 actions=set_field:0x8/0x8->reg10,group:1

However, the flow with -new-est+trk is required for the COOKIE and COOKIE ACK to work:
cookie=0x0, duration=62870.911s, table=14, n_packets=26, n_bytes=3520, idle_age=10387, priority=120,ct_state=-new-est+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31769 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)

Something is wrong with how OVS is determining ct_state for SCTP conntrack sessions.

Comment 1 Aaron Conole 2020-03-20 18:40:42 UTC
I tested with the following setup (no modifications to kernel or ovs):

# using kernel 5.6.0-rc6+, ovs master built using make rpm-fedora and installed

ip netns add left
ip netns add right
ip link add center-left type veth peer name left0
ip link add center-right type veth peer name right0
ip link set center-left up
ip link set center-right up
ip link set left0 netns left
ip link set right0 netns right
ip netns exec left ip addr add 172.31.110.1/30 dev left0
ip netns exec right ip addr add 172.31.110.2/30 dev right0
ip netns exec left ip link set left0 up
ip netns exec right ip link set right0 up

# just to ignore any possible selinux issues...
setenforce Permissive
systemctl start openvswitch

systemctl start openvswitch
ovs-vsctl add-br br0 -- set Bridge br0 fail-mode=secure
ovs-vsctl add-port br0 center-left
ovs-vsctl add-port br0 center-right
ovs-ofctl add-flow br0 table=0,arp,action=NORMAL

ovs-ofctl add-flow br0 'table=0,sctp,actions=ct(table=1)'
ovs-ofctl add-flow br0 \
  'table=1,sctp,in_port=center-left,ct_state=+trk+new,actions=ct(commit),center-right'
ovs-ofctl add-flow br0 \
  'table=1,sctp,in_port=center-right,ct_state=+rpl+trk,actions=center-left'
ovs-ofctl add-flow br0 \
  'table=1,sctp,in_port=center-left,ct_state=+trk+est,actions=center-right'
ovs-ofctl add-flow br0 \
  'table=1,sctp,in_port=center-right,ct_state=+trk+est,actions=center-left'

# ensure arp is following action normal
ip netns exec left arping 172.31.110.2 -I left0

# in one terminal
ip netns exec right ncat --listen --sctp -vv

# in another terminal
ip netns exec left ncat --sctp 172.31.110.2 31337

Result:

[root@wsfd-netdev92 ~]# ip netns exec right ncat --listen --sctp -vv
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Listening on :::31337
Ncat: Listening on 0.0.0.0:31337
Ncat: Connection from 172.31.110.1.
Ncat: Connection from 172.31.110.1:34461.
asdf
fdsa
fasdfsadf
asdfasdfasdfda


Seems I have bidirectional communications... It looks like you need the
+rpl flow to match replies (which is what I would expect).

Looking at the dpctl flows, I see the following display (periodically):
recirc_id(0x2b),in_port(3),ct_state(-new+rpl+trk),eth(),eth_type(0x0800),ipv4(proto=132,frag=no),
packets:1, bytes:98, used:4.310s, actions:2
recirc_id(0x2c),in_port(2),ct_state(-new+est-rpl+trk),eth(),eth_type(0x0800),ipv4(proto=132,frag=no),
packets:1, bytes:98, used:4.314s, actions:3

And from dump-conntrack:
sctp,orig=(src=172.31.110.1,dst=172.31.110.2,sport=34461,dport=31337),reply=(src=172.31.110.2,dst=172.31.110.1,sport=31337,dport=34461),protoinfo=(state=ESTABLISHED,vtag_orig=2708668805,vtag_reply=1149194430)


I will retry with the rhel8 kernel, but I think this is a flow programming issue on OVN side.

Comment 2 Aaron Conole 2020-03-20 19:16:16 UTC
I've retried with the rhel8 kernel, and the openvswitch2.12-2.12.0-22, still don't get any issues.  

I thought maybe the +rpl is what is enabling it, but even if I omit that, it works.
I also thought maybe +inv+trk is catching things, but even with that, I can establish connection and pass data.

   cookie=0x0, duration=184.091s, table=0, n_packets=4, n_bytes=168, arp actions=NORMAL
   cookie=0x0, duration=180.106s, table=0, n_packets=28, n_bytes=2648, sctp actions=ct(table=1)
   cookie=0x0, duration=173.104s, table=1, n_packets=2, n_bytes=164, ct_state=+new+trk,sctp,in_port="center-left" actions=ct(commit),output:"center-right"
   cookie=0x0, duration=166.284s, table=1, n_packets=11, n_bytes=1106, ct_state=+est+trk,sctp,in_port="center-left" actions=output:"center-right"
   cookie=0x0, duration=140.508s, table=1, n_packets=15, n_bytes=1378, ct_state=+est+trk,sctp,in_port="center-right" actions=output:"center-left"
   cookie=0x0, duration=41.354s, table=1, n_packets=0, n_bytes=0, ct_state=+inv+trk,sctp actions=drop

Not sure what to say at this point.  There's definitely a problem with the openflow rules that OVN is installing, I think.

Comment 3 Aaron Conole 2020-03-20 19:29:50 UTC
Finally, it might be good to check whether there is an SCTP offload or something else hardware-wise which is causing problems.

Comment 4 Aaron Conole 2020-03-20 19:35:19 UTC
Changing ownership and component to Mark / OVN.  Feel free to reassign if I missed something and you have a minimum set of flows that reproduces the issue.

Comment 5 Mark Michelson 2020-03-24 13:54:43 UTC
Hi Aaron,

thanks for the tests. The confusing thing from OVN's perspective is that the flows we install right now work fine for TCP, but they are foiled by SCTP. That's what gives the impression from our side that SCTP has some issues that TCP doesn't. Or at least, SCTP has differences that need to be accounted for in our installed flows. It's just not obvious what those differences are.

We presumably could take the flows OVN produces and distill them down to just the set of flows that exhibit the problem. If we can provide those for you, then hopefully that would help.

Comment 6 Tim Rozet 2020-03-24 16:45:38 UTC
I tried manually diverting traffic in the broken setup to new tables where I manually programmed flows. Everything seems to work fine. est matches and sctp client/server connection works:

[root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int | egrep 'table=(79|80|81|82|83)'
 cookie=0x0, duration=1931.902s, table=0, n_packets=70, n_bytes=9452, idle_age=15, priority=20000,sctp actions=ct(table=79)
 cookie=0x0, duration=4772.004s, table=79, n_packets=71, n_bytes=5822, idle_age=275, priority=30000,ct_state=+new+trk,sctp,nw_src=169.254.33.1,nw_dst=169.254.33.2 actions=ct(commit,table=80,zone=1337,nat(src=100.64.0.1))
 cookie=0x0, duration=83.512s, table=79, n_packets=6, n_bytes=588, idle_age=15, priority=30000,ct_state=+new+trk,sctp,nw_src=10.244.0.5,nw_dst=100.64.0.1 actions=ct(table=81,zone=1338,nat)
 cookie=0x0, duration=358.013s, table=79, n_packets=16, n_bytes=1748, idle_age=15, priority=30000,ct_state=+est+trk,sctp,nw_src=169.254.33.1,nw_dst=169.254.33.2 actions=ct(table=81,zone=1337,nat)
 cookie=0x0, duration=10.401s, table=79, n_packets=0, n_bytes=0, idle_age=10, priority=30000,ct_state=+trk,sctp,nw_src=10.244.0.5,nw_dst=100.64.0.1 actions=ct(table=81,zone=1338,nat)
 cookie=0x0, duration=2204.114s, table=80, n_packets=29, n_bytes=2378, idle_age=275, ct_state=+new+trk,sctp,nw_dst=169.254.33.2 actions=ct(table=81,zone=1337,nat)
 cookie=0x0, duration=1444.205s, table=81, n_packets=11, n_bytes=902, idle_age=275, priority=10,ct_state=+new+trk,sctp,nw_dst=169.254.33.2 actions=group:82
 cookie=0x0, duration=627.910s, table=81, n_packets=17, n_bytes=2034, idle_age=15, priority=10,sctp,nw_dst=100.64.0.1 actions=ct(table=83,zone=1337,nat)
 cookie=0x0, duration=287.405s, table=81, n_packets=16, n_bytes=1748, idle_age=15, priority=10,ct_state=+est+trk,sctp,nw_dst=169.254.33.2 actions=ct(table=83,zone=1338,nat)
 cookie=0x0, duration=1752.007s, table=82, n_packets=0, n_bytes=0, idle_age=1752, priority=20000,ct_state=+new+trk,sctp,nw_dst=169.254.33.2 actions=ct(table=83,zone=1338,nat)
 cookie=0x0, duration=848.908s, table=82, n_packets=2, n_bytes=164, idle_age=275, priority=1000,ct_state=+new+trk,sctp actions=ct(table=83,zone=1338,nat)
 cookie=0x0, duration=4209.814s, table=83, n_packets=60, n_bytes=5356, idle_age=15, priority=30000,sctp,nw_dst=10.244.0.5 actions=set_field:72:93:2f:f4:00:06->eth_dst,set_field:0a:58:64:40:00:01->eth_src,output:5
 cookie=0x0, duration=3675.708s, table=83, n_packets=53, n_bytes=11706, idle_age=15, priority=10,sctp actions=set_field:00:00:a9:fe:21:01->eth_dst,set_field:9e:4d:9a:a7:67:46->eth_src,output:2
[root@ovn-control-plane ~]# 
[root@ovn-control-plane ~]# 
[root@ovn-control-plane ~]# 
[root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-groups br-int |grep 82
 group_id=82,type=select,bucket=bucket_id:0,weight:100,actions=ct(commit,table=82,zone=1338,nat(dst=10.244.0.5:62324))


So I think it must be something with OVN flows.

Comment 7 Tim Rozet 2020-03-24 17:29:26 UTC
But then again if I add an inv ct match flow, everything works...

[root@ovn-control-plane ~]# ovs-ofctl -O openflow15 add-flow br-int 'table=14,priority=120,ct_state=-new+inv+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31291,actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)'


[root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int |grep table=14 |grep sctp
 cookie=0x7ee7a259, duration=80.031s, table=14, n_packets=0, n_bytes=0, idle_age=80, priority=120,ct_state=+est+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31291 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)
 cookie=0xee91972c, duration=80.031s, table=14, n_packets=3, n_bytes=278, idle_age=8, priority=120,ct_state=+new+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31291 actions=set_field:0x8/0x8->reg10,group:1
 cookie=0x0, duration=49.400s, table=14, n_packets=4, n_bytes=464, idle_age=43, priority=120,ct_state=-new+inv+trk,sctp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31291 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)

Comment 8 Marcelo Ricardo Leitner 2020-03-24 17:36:48 UTC
(In reply to Tim Rozet from comment #7)
> But then again if I add an inv ct match flow, everything works...

Can you capture these packets that match +inv? Maybe by mirroring them to another port or so.
Another idea is to enable netfilter debugs, net.netfilter.nf_conntrack_log_invalid.
  how-to: https://access.redhat.com/solutions/796763

Comment 9 Tim Rozet 2020-03-24 18:04:06 UTC
Here is the capture:
https://www.cloudshark.org/captures/db7fb831dab2

I think I may see what is wrong. The ct matching invalid means  Invalid (inv): The connection tracker couldn't identify the connection. The flow before table 14 is in table 13:


Flow: recirc_id=0x185,ct_state=est|trk,ct_zone=9,eth,sctp,reg9=0x8,reg11=0x9,reg12=0x6,reg14=0x2,metadata=0x3,in_port=2,vlan_tci=0x0000,dl_src=00:00:a9:fe:21:01,dl_dst=9e:4d:9a:a7:67:46,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=0,tp_dst=31790

bridge("br-int")
----------------
    thaw
        Resuming from table 13
13. ip,metadata=0x3,nw_dst=169.254.33.2, priority 90, cookie 0x85a8499
    ct(table=14,zone=NXM_NX_REG12[0..15],nat)
    nat


This is sending to zone in reg12, which is 6. In OVN I can see that CT zone is SNAT: "ct-zone-d13ee111-a5a1-4bfd-87ce-23bc1dac34c1_snat"="6"

This means that packets from 169.254.33.1 -> 169.254.33.2 should be SNAT to the router ip 100.64.0.1. However I see in the conntrack table:

[root@ovn-control-plane ~]# ovs-appctl dpctl/dump-conntrack |grep sctp
sctp,orig=(src=169.254.33.1,dst=169.254.33.2,sport=43357,dport=31291),reply=(src=10.244.0.5,dst=169.254.33.1,sport=62324,dport=43357),zone=9,protoinfo=(state=ESTABLISHED,vtag_orig=924887357,vtag_reply=544223573)
sctp,orig=(src=169.254.33.1,dst=169.254.33.2,sport=43357,dport=31291),reply=(src=169.254.33.2,dst=169.254.33.1,sport=31291,dport=43357),protoinfo=(state=ESTABLISHED,vtag_orig=924887357,vtag_reply=544223573)
sctp,orig=(src=169.254.33.1,dst=10.244.0.5,sport=43357,dport=62324),reply=(src=10.244.0.5,dst=100.64.0.1,sport=62324,dport=43357),zone=6,protoinfo=(state=ESTABLISHED,vtag_orig=924887357,vtag_reply=544223573)
sctp,orig=(src=100.64.0.1,dst=10.244.0.5,sport=43357,dport=62324),reply=(src=10.244.0.5,dst=100.64.0.1,sport=62324,dport=43357),zone=15,protoinfo=(state=ESTABLISHED,vtag_orig=924887357,vtag_reply=544223573)


This entry does not look right:
sctp,orig=(src=169.254.33.1,dst=10.244.0.5,sport=43357,dport=62324),reply=(src=10.244.0.5,dst=100.64.0.1,sport=62324,dport=43357),zone=6,protoinfo=(state=ESTABLISHED,vtag_orig=924887357,vtag_reply=544223573)

This looks wrong to me because this session indicates that the DNAT has already occurred, which it has not. That will happen in table 14. So I don't see how CT matches here, and thus may now consider the packet invalid.

I'll also try enabling the logging.

Comment 10 Marcelo Ricardo Leitner 2020-03-24 18:36:02 UTC
(In reply to Tim Rozet from comment #9)
> Here is the capture:
> https://www.cloudshark.org/captures/db7fb831dab2

All these matched +inv!? O.O

Btw, seems the original packets from server always have sctp crc32 zeroed out, but then the NATed one has it filled in, with the exception of COOKIE_ACK (frame 8) and HEARTBEAT_ACK (frame 16). This flea is still itching behind my ears...

Comment 11 Tim Rozet 2020-03-24 18:48:27 UTC
So I think this is definitely the problem. The zones are inverse as I mentioned in comment 9. If I add my own flows to fix it:


[root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int table=13 |grep sctp
 cookie=0x0, duration=1013.896s, table=13, n_packets=33, n_bytes=4490, idle_age=9, priority=95,sctp,metadata=0x3,nw_dst=169.254.33.2 actions=ct(table=14,zone=NXM_NX_REG11[0..15],nat)


[root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int table=14 |grep REG
 cookie=0x0, duration=44.396s, table=14, n_packets=3, n_bytes=474, idle_age=4, priority=125,ct_state=+est+trk,sctp,metadata=0x3,nw_src=169.254.33.1 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG12[0..15],nat)

Notice now zone=REG11 (zone 9, DNAT) is happening first in table 13, and then in table 14 +est now matches, and then I do nat on zone=REG12 (zone 6, SNAT). The connection then works fine.

I think we can close this bug and I'll update:
https://bugzilla.redhat.com/show_bug.cgi?id=1718372

Comment 12 Tim Rozet 2020-03-24 19:47:08 UTC
Actually I think we should keep this bug open and track this separately in OVN. It looks to be a bug with TCP as well, just somehow TCP traffic still works. Look at this trace:
===============================================================================
recirc(0x867) - resume conntrack with ct_state=est|trk
===============================================================================

Flow: recirc_id=0x867,ct_state=est|trk,ct_zone=9,eth,tcp,reg9=0x8,reg11=0x9,reg12=0x6,reg14=0x2,metadata=0x3,in_port=2,vlan_tci=0x0000,dl_src=00:00:a9:fe:21:01,dl_dst=9e:4d:9a:a7:67:46,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=0,tp_dst=31586,tcp_flags=0

bridge("br-int")
----------------
    thaw
        Resuming from table 13
13. ip,metadata=0x3,nw_dst=169.254.33.2, priority 90, cookie 0x85a8499
    ct(table=14,zone=NXM_NX_REG12[0..15],nat)
    nat
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 14.
     -> Sets the packet to an untracked state, and clears all the conntrack fields.

Final flow: recirc_id=0x867,eth,tcp,reg9=0x8,reg11=0x9,reg12=0x6,reg14=0x2,metadata=0x3,in_port=2,vlan_tci=0x0000,dl_src=00:00:a9:fe:21:01,dl_dst=9e:4d:9a:a7:67:46,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=0,tp_dst=31586,tcp_flags=0
Megaflow: recirc_id=0x867,eth,tcp,in_port=2,nw_dst=169.254.33.2,nw_frag=no
Datapath actions: ct(zone=6,nat),recirc(0x868)

===============================================================================
recirc(0x868) - resume conntrack with ct_state=est|trk
===============================================================================
Flow: recirc_id=0x868,ct_state=est|trk,ct_zone=6,eth,tcp,reg9=0x8,reg11=0x9,reg12=0x6,reg14=0x2,metadata=0x3,in_port=2,vlan_tci=0x0000,dl_src=00:00:a9:fe:21:01,dl_dst=9e:4d:9a:a7:67:46,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=0,tp_dst=31586,tcp_flags=0

bridge("br-int")
----------------
    thaw
        Resuming from table 14
14. ct_state=+est+trk,tcp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31586, priority 120, cookie 0xaa1ed9ed
    load:0x1->NXM_NX_REG10[3]
    ct(table=15,zone=NXM_NX_REG11[0..15],nat)
    nat
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 15.
     -> Sets the packet to an untracked state, and clears all the conntrack fields.

Final flow: recirc_id=0x868,eth,tcp,reg9=0x8,reg10=0x8,reg11=0x9,reg12=0x6,reg14=0x2,metadata=0x3,in_port=2,vlan_tci=0x0000,dl_src=00:00:a9:fe:21:01,dl_dst=9e:4d:9a:a7:67:46,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=0,tp_dst=31586,tcp_flags=0
Megaflow: recirc_id=0x868,ct_state=-new+est-rel-rpl-inv+trk,ct_label=0/0x1,eth,tcp,in_port=2,nw_src=169.254.33.1,nw_dst=169.254.33.2,nw_frag=no,tp_dst=31586
Datapath actions: ct(zone=9,nat),recirc(0x869)


I can see the flow in table 14 never gets hit for est traffic either with TCP:
[root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int table=13 |grep 0x85a8499
 cookie=0x85a8499, duration=9084.977s, table=13, n_packets=22838, n_bytes=11380232, idle_age=1, priority=90,ip,metadata=0x3,nw_dst=169.254.33.2 actions=ct(table=14,zone=NXM_NX_REG12[0..15],nat)
[root@ovn-control-plane ~]# ovs-ofctl -O openflow15 dump-flows br-int table=14 |grep 0xaa1ed9ed
 cookie=0xaa1ed9ed, duration=1282.030s, table=14, n_packets=0, n_bytes=0, idle_age=1282, priority=120,ct_state=+est+trk,tcp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31586 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)


Then adding an invalid flow it matches...
 cookie=0x0, duration=13.198s, table=14, n_packets=1, n_bytes=66, idle_age=8, priority=120,ct_state=-new+inv+trk,tcp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31586 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)

Comment 13 Mark Michelson 2020-03-25 13:22:17 UTC
BTW, here's a relevant comment from the OVN northd code:

            /* Ingress UNSNAT table: It is for already established connections'
             * reverse traffic. i.e., SNAT has already been done in egress
             * pipeline and now the packet has entered the ingress pipeline as
             * part of a reply. We undo the SNAT here.
             *
             * Undoing SNAT has to happen before DNAT processing.  This is
             * because when the packet was DNATed in ingress pipeline, it did
             * not know about the possibility of eventual additional SNAT in
             * egress pipeline. */

To translate, UNSNAT is performed in table 13, and DNAT is performed in table 14.

The comment makes sense. Consider that SNAT and DNAT are configured on a gateway router. When initial external traffic comes into the router, it gets DNATted. Then the reply traffic gets UNDNATted (table 40) and then SNATted (table 41) on the way out. Now as future external traffic arrives at the router, it needs to get UNSNATted first because the DNAT was initially created with no knowledge of the SNAT that would later be applied.

In practice, something is fishy here. You're saying you have to DNAT first in order for conntrack to be happy. My hunch is that the reply traffic is not getting UNDNATted in table 40. Tim, could you dump your OpenFlow tables 40 and 41 so we can see what's happening there?

Comment 14 Tim Rozet 2020-03-25 13:43:40 UTC
Sure the relevant table 41 flows are:
 cookie=0x54b52682, duration=73327.780s, table=41, n_packets=412, n_bytes=40462, priority=100,ip,reg10=0x8/0x8,metadata=0x3 actions=ct(commit,table=42,zone=NXM_NX_REG12[0..15],nat(src=100.64.0.1))
 cookie=0x54b52682, duration=73327.776s, table=41, n_packets=0, n_bytes=0, priority=100,ipv6,reg10=0x8/0x8,metadata=0x3 actions=ct(commit,table=42,zone=NXM_NX_REG12[0..15],nat(src=100.64.0.1))

The above is effectively the same result. The key part of that comment from northd is the 2nd paragraph. The SNAT would happen in the egress pipeline. If you ignore table 13 (incorrectly matching), table 14 would do DNAT, then table 41 in egress pipeline would do SNAT, which is the the order I changed things so that it would work. I just did it in table 13, when table 13 flow should really not get hit at all (UNSNAT).

The problem is, nothing should be being UNSNAT, this traffic is destined to be NAT'ed. I think the lflow Numan said corresponded is just matching the packet on accident:

 table=5 (lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 169.254.33.2), action=(ct_snat;)

The traffic flow here is:

169.254.33.1  -------> 169.254.33.3 - router - 100.64.0.1 ------------->10.244.0.5
client                                                                     server

packet destined from client -> server is not NAT'ed yet, and accidentally matches the UNSNAT flow in table 13, except it matches no session in CT, and gets marked as invalid (because the faulty match is trying to SNAT before DNAT)

Comment 15 Tim Rozet 2020-03-25 13:44:52 UTC
I meant in the last comment router ip is 169.254.33.2*

Comment 16 Numan Siddique 2020-03-25 16:41:59 UTC
I found the fix for this issue.

The reason this happens is because 
 - First dnat happens
  - Then snat happens because of lb_force_snat_ip is set on the gateway router.

The first SCTP INIT and SCTP ACK goes fine without any issues.

When SCTP COOKIE ECHO is sent, the below lflow is hit,  since in this issue, the dnat_and_snat entry is also used as LB VIP
 table=5 (lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 169.254.33.2), action=(ct_snat;)

The above lflow causes, the destination ip to be natted to 10.244.0.5. And this causes the  below lflow to miss.

 table=6 (lr_in_dnat         ), priority=120  , match=(ct.est && ip && ip4.dst == 169.254.33.2 && sctp && sctp.dst == 31291), action=(flags.force_snat_for_lb = 1; ct_dnat;)

Since lags.force_snat_for_lb is not set, the snat of the src ip doesn't happen.



Adding the below lflows fixes this issue

 table=5 (lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 169.254.33.2 && sctp.dst == 31291), action=(next;)

https://github.com/numansiddique/ovn/commit/81b0474606e3ef66f4cba903e0a84ec710f89e38

Comment 17 Tim Rozet 2020-03-25 19:42:19 UTC
The fix works. Can we get that and Mark's patch merged upstream with a new F31 build? Then I can get it to pass in ovn-k8s CI. Thanks.

Comment 18 Mark Michelson 2020-03-26 00:14:19 UTC
Thanks Numan! I think we should get your patch added to my patch and make it a single patch series for review upstream.

Comment 19 Numan Siddique 2020-03-26 11:39:42 UTC
Mark - I think the issue is seen even for TCP. So I submitted the fix here - https://patchwork.ozlabs.org/patch/1261948/

Can you please rebase your patch on top of this and test it out ?

To summarize, I don't think this is an OVN issue. But we have a way to avoid this and the patch I submitted does that.

But looks like there is an issue in ovs-vswitchd.

I've pasted the commit message below which has some details

*****
Suppose there is below NAT entry with external_ip = 172.168.0.100

nat <UUID>
    external ip: "172.168.0.100"
    logical ip: "10.0.0.0/24"
    type: "snat"

And a load balancer with the VIP - 172.168.0.100

_uuid               : <UUID>
external_ids        : {}
name                : lb1
protocol            : tcp
vips                : {"172.168.0.100:8080"="10.0.0.4:8080"}

And if these are associated to a gateway logical router

Then we will see the below lflows in the router pipeline

...
table=5 (lr_in_unsnat       ), priority=90   , match=(ip && ip4.dst == 172.168.0.100), action=(ct_snat;)
...
table=6 (lr_in_dnat         ), priority=120  , match=(ct.new && ip && ip4.dst == 172.168.0.100 && tcp && tcp.dst == 8080), action=(ct_lb(10.0.0.4:8080);)
table=6 (lr_in_dnat         ), priority=120  , match=(ct.est && ip && ip4.dst == 172.168.0.100 && tcp && tcp.dst == 8080), action=(ct_dnat;)

When a new connection packet destinated for the lb vip 172.168.0.100 and tcp.dst = 8080
is received, the ct.new flow in the lr_in_dnat is hit and the packet's ip4.dst is
dnatted to 10.0.0.4 in the dnat conntrack zone.

But for the subsequent packet destined to the vip, the ct.est lflow in the lr_in_dnat
stage doesn't get hit. In this case, the packet first hits the lr_in_unsnat pri 90 flow
as mentioned above with the action ct_snat. Even though ct_snat should have no effect,
but looks like there is some issue in either ovs-vswitchd or in datapath, and the ct flags
are not set properly.

In the case of tcp, the ct.new flow is hit instead of ct.est. In the the case of sctp, neither of the above
lflows in lr_in_dnat stage hit.

This needs to be investigated further. But we can avoid this scenario in OVN
by adding the below lflow.

table=5 (lr_in_unsnat       ), priority=120  , match=(ip4 && ip4.dst == 172.168.0.100 && tcp.dst == 8080), action=(next;)
****

Comment 20 Mark Michelson 2020-03-26 12:24:59 UTC
"Even though ct_snat should have no effect,
but looks like there is some issue in either ovs-vswitchd or in datapath, and the ct flags
are not set properly."

I think it's debatable about whether that should have no effect. However, it definitely should have a consistent effect. In TCP's case, it appears to reset the ct flags. In SCTP's case, it sets the ct state to invalid. I think the SCTP behavior makes sense. If we try to ct_snat but there is no conntrack entry, then we're trying something invalid.

I'll save patch comments for the ML. Thanks Numan!

Comment 21 Tim Rozet 2020-03-26 13:43:01 UTC
IMO this is incorrect:
"Even though ct_snat should have no effect,
but looks like there is some issue in either ovs-vswitchd or in datapath, and the ct flags
are not set properly."

It most certainly has an effect. Look at comment https://bugzilla.redhat.com/show_bug.cgi?id=1815217#c9

By sending to CT where there is no match, the packet is marked as CT invalid, which will not match the +est flow. This is proper behavior of OVS.

Comment 22 Tim Rozet 2020-03-26 13:45:40 UTC
"In TCP's case, it appears to reset the ct flags." It also sets the ct state to invalid, see:

cookie=0x0, duration=13.198s, table=14, n_packets=1, n_bytes=66, idle_age=8, priority=120,ct_state=-new+inv+trk,tcp,metadata=0x3,nw_dst=169.254.33.2,tp_dst=31586 actions=set_field:0x8/0x8->reg10,ct(table=15,zone=NXM_NX_REG11[0..15],nat)

and comment: https://bugzilla.redhat.com/show_bug.cgi?id=1815217#c12

Comment 23 Mark Michelson 2020-03-26 13:54:32 UTC
Ah, OK, I misinterpreted this part of the commit message: 

"In the case of tcp, the ct.new flow is hit instead of ct.est"

I suppose both are setting the invalid flag, but TCP is also setting/keeping the new flag as well.

Comment 24 Mark Michelson 2020-03-26 13:56:14 UTC
Wait, but your matching flow for table 14 had -new+inv, which indicates that TCP doesn't have the new flag set. Is the commit message just incorrect?

Comment 25 Tim Rozet 2020-03-26 14:07:46 UTC
It's the same issue for both protocols. TCP just happens to still make it to the endpoint because of some other flows in the pipeline handling it accidentally. I didn't bother to figure out what that path is, as I don't see a need to.

Comment 27 Numan Siddique 2020-03-30 11:32:54 UTC
Do we need this in downstream OVN 2.12 ? How about we backport this to only OVN 2.13 ?

Comment 31 ying xu 2020-04-10 08:46:57 UTC
set the env:
client:

		ovs-vsctl add-br br-provider
		ovs-vsctl add-port br-provider $nic_test2
		ip link set $nic_test2 up
		ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider
		ip link set br-provider up
		ip netns add client1
		ip link add veth0_c1_p type veth peer name veth0_c1 netns client1
		ip link set veth0_c1_p up
		ovs-vsctl add-port br-provider veth0_c1_p
		ip netns exec client1 ip link set lo up
		ip netns exec client1 ip link set veth0_c1 up
		ip netns exec client1 ip addr add 172.16.1.2/24 dev veth0_c1
		ip netns exec client1 ip addr add 2002::2/64 dev veth0_c1

		
server:
	
		ovs-vsctl add-br br-provider
		ovs-vsctl add-port br-provider $nic_test2
		ip link set $nic_test2 up
		ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider
		ip link set br-provider up

		ip netns add client0
		ip link add veth0_c0_p type veth peer name veth0_c0 netns client0
		ip link set veth0_c0_p up
		ovs-vsctl add-port br-provider veth0_c0_p
		ip netns exec client0 ip link set lo up
		ip netns exec client0 ip link set veth0_c0 up
		ip netns exec client0 ip addr add 172.16.1.1/24 dev veth0_c0
		ip netns exec client0 ip addr add 2002::1/64 dev veth0_c0

		ovn-nbctl ls-add ls1
		ovn-nbctl lsp-add ls1 ls1p1
		ovn-nbctl lsp-set-addresses ls1p1 00:01:02:01:01:01

		ovn-nbctl lsp-add ls1 ls1p2
		ovn-nbctl lsp-set-addresses ls1p2 00:01:02:01:01:02

		ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal
		ip netns add server0
		ip link set vm1 netns server0
		ip netns exec server0 ip link set lo up
		ip netns exec server0 ip link set vm1 up
		ip netns exec server0 ip link set vm1 address 00:01:02:01:01:01
		ip netns exec server0 ip addr add 192.168.0.1/24 dev vm1
		ip netns exec server0 ip addr add 3001::1/64 dev vm1
		ip netns exec server0 ip route add default via 192.168.0.254 dev vm1
		ip netns exec server0 ip -6 route add default via 3001::a dev vm1
		ovs-vsctl set Interface vm1 external_ids:iface-id=ls1p1

		ovs-vsctl add-port br-int vm2 -- set interface vm2 type=internal
		ip netns add server1
		ip link set vm2 netns server1
		ip netns exec server1 ip link set lo up
		ip netns exec server1 ip link set vm2 up
		ip netns exec server1 ip link set vm2 address 00:01:02:01:01:02
		ip netns exec server1 ip addr add 192.168.0.2/24 dev vm2
		ip netns exec server1 ip addr add 3001::2/64 dev vm2
		ip netns exec server1 ip route add default via 192.168.0.254 dev vm2
		ip netns exec server1 ip -6 route add default via 3001::a dev vm2
		ovs-vsctl set Interface vm2 external_ids:iface-id=ls1p2

		ovn-nbctl lr-add lr1
		ovn-nbctl lrp-add lr1 lr1ls1 00:01:02:0d:01:01 192.168.0.254/24 3001::a/64

		ovn-nbctl lsp-add ls1 ls1lr1
		ovn-nbctl lsp-set-type ls1lr1 router
		ovn-nbctl lsp-set-options ls1lr1 router-port=lr1ls1
		ovn-nbctl lsp-set-addresses ls1lr1 00:01:02:0d:01:01 192.168.0.254 3001::a

		#add sctp lb with fip
		ovn-nbctl ls-add public
		ovn-nbctl lsp-add public plr1
		ovn-nbctl lsp-set-type plr1 router
		ovn-nbctl lsp-set-options plr1 router-port=lr1p
		ovn-nbctl lsp-set-addresses plr1 00:01:02:0d:0f:01 172.16.1.254 2002::a
		ovn-nbctl lsp-add public ln_public
		ovn-nbctl lsp-set-type ln_public localnet
		ovn-nbctl lsp-set-addresses ln_public unknown
		ovn-nbctl lsp-set-options ln_public network_name=provider

		ovn-nbctl lrp-set-gateway-chassis lr1p hv1 20
		ovn-nbctl lrp-set-gateway-chassis lr1p hv0 10

		ovn-nbctl lr-nat-add lr1 dnat_and_snat 172.16.1.10 192.168.2.1
		ovn-nbctl lr-nat-add lr1 dnat_and_snat 2002::100 3000::100
		ip netns exec server0 ping 192.168.0.2 -c 3
		ovn-nbctl show
		ovn-sbctl show
		sleep 10
		ovn-nbctl lb-add lb1 192.168.2.1:80 192.168.0.1:80,192.168.0.2:80
                ovn-nbctl lb-add lb1 172.16.1.10:80 192.168.0.1:80,192.168.0.2:80
                ovn-nbctl lb-add lb1 [3000::100]:80 [3001::1]:80,[3001::2]:80
                ovn-nbctl lb-add lb1 [2002::100]:80 [3001::1]:80,[3001::2]:80
                ovn-nbctl lr-lb-add lr1 lb1
		uuid=`ovn-nbctl list Load_Balancer |grep uuid|awk '{printf $3}'`
                ovn-nbctl set Load_Balancer $uuid protocol=tcp

I reproduced this bug on version:
# rpm -qa|grep ovn
ovn2.13-host-2.13.0-7.el8fdp.x86_64
ovn2.13-central-2.13.0-7.el8fdp.x86_64
ovn2.13-2.13.0-7.el8fdp.x86_64

start ncat server on server:
ip netns exec server0 nc -l 80 -k &
ip netns exec server1 nc -l 80 -k &

start ncat client on client:
ip netns exec client1 nc --sctp 172.16.1.10 80&

then,on server,
ovs-ofctl -O openflow15 dump-flows br-int |grep tcp |grep table=14 |grep 172.16.1.10
 cookie=0x8824cf34, duration=2727.250s, table=14, n_packets=0, n_bytes=0, idle_age=2727, priority=120,ct_state=+est+trk,tcp,metadata=0x3,nw_dst=172.16.1.10,tp_dst=80 actions=ct(table=15,zone=NXM_NX_REG11[0..15],nat)
 cookie=0xccb79a1d, duration=2727.251s, table=14, n_packets=5, n_bytes=346, idle_age=459, priority=120,ct_state=+new+trk,tcp,metadata=0x3,nw_dst=172.16.1.10,tp_dst=80 actions=group:1

in the first flow,ct_state=+est+trk,and the n_packets=0,the packet didn't increase correctly. all packets increase to the second flow

verified on version:
# rpm -qa|grep ovn
ovn2.13-host-2.13.0-11.el8fdp.x86_64
ovn2.13-central-2.13.0-11.el8fdp.x86_64
ovn2.13-2.13.0-11.el8fdp.x86_64

# ovs-ofctl -O openflow15 dump-flows br-int |grep tcp |grep table=14 |grep 172.16.1.10
 cookie=0x4ac41557, duration=366.568s, table=14, n_packets=3, n_bytes=198, idle_age=20, priority=120,ct_state=+est+trk,tcp,metadata=0x3,nw_dst=172.16.1.10,tp_dst=80 actions=ct(table=15,zone=NXM_NX_REG11[0..15],nat)
 cookie=0x68475f08, duration=366.568s, table=14, n_packets=2, n_bytes=148, idle_age=20, priority=120,ct_state=+new+trk,tcp,metadata=0x3,nw_dst=172.16.1.10,tp_dst=80 actions=group:2

here we can see that in the first flow ct_state=+est+trk,the packets counts increase correctly.

Comment 33 errata-xmlrpc 2020-04-20 19:43:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1501


Note You need to log in before you can comment on or make changes to this bug.