Bug 2004995 - OVN: use stateful snat_and_dnat for FIPs
Summary: OVN: use stateful snat_and_dnat for FIPs
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z16
: ---
Assignee: Ihar Hrachyshka
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-16 14:30 UTC by Haresh Khandelwal
Modified: 2022-06-02 06:51 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-21 12:13:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 838776 0 None NEW ovn: revert to stateful dnat_and_snat 2022-06-02 06:51:53 UTC
Red Hat Issue Tracker FD-1546 0 None None None 2021-09-18 19:08:40 UTC
Red Hat Issue Tracker NFV-2340 0 None None None 2021-11-30 14:46:11 UTC
Red Hat Issue Tracker OSP-11104 0 None None None 2021-11-29 20:36:50 UTC

Description Haresh Khandelwal 2021-09-16 14:30:39 UTC
Description of problem:

Since statefull NAT is not being offloaded, i tried to explore stateless. 

[root@hareshcontrollersriov-0 /]# ovn-nbctl lr-nat-list 473faace-478c-4841-b438-84c3ebaaa528
TYPE             EXTERNAL_IP        EXTERNAL_PORT    LOGICAL_IP            EXTERNAL_MAC         LOGICAL_PORT
dnat_and_snat    10.10.54.129                        7.7.7.68              fa:16:3e:fd:50:3c    4f408c64-7474-4f58-81ec-3f8abba72562
snat             10.10.54.140                        7.7.7.0/24

ovn-nbctl lr-nat-del 473faace-478c-4841-b438-84c3ebaaa528 dnat_and_snat 10.10.54.129
ovn-nbctl --stateless lr-nat-add 473faace-478c-4841-b438-84c3ebaaa528 dnat_and_snat 10.10.54.129 7.7.7.68 4f408c64-7474-4f58-81ec-3f8abba72562 fa:16:3e:fd:50:3c

Flow table:

ufid:9a58f50a-66c0-4983-849d-d98a4e889af6, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f0_2),packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:f2,dst=fa:16:3e:84:64:93),eth_type(0x0800),ipv4(src=7.7.7.68,dst=0.0.0.0/0.0.0.0,proto=1,tos=0/0,ttl=0/0,frag=no),icmp(type=0/0,code=0/0), packets:21, bytes:1764, used:0.810s, offloaded:yes, dp:tc, actions:ct(zone=9),recirc(0x81)

ufid:907831fd-eb36-450c-8c43-9d17aab6b351, skb_priority(0/0),skb_mark(0/0),ct_state(0x2a/0x3e),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),recirc_id(0x81),dp_hash(0/0),in_port(enp4s0f0_2),packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:f2,dst=fa:16:3e:84:64:93),eth_type(0x0800),ipv4(src=7.7.7.68,dst=10.10.54.100,proto=1,tos=0/0,ttl=64,frag=no),icmp(type=0/0,code=0/0), packets:21, bytes:1764, used:0.810s, offloaded:yes, dp:tc, actions:ct_clear,ct_clear,set(eth(src=fa:16:3e:2b:0f:5d,dst=72:f9:61:09:b9:79)),set(ipv4(src=10.10.54.129,ttl=63)),push_vlan(vid=405,pcp=0),mx-bond

ufid:844ea973-5817-47f9-a94d-a45d746e0b9e, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x3f),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),recirc_id(0),dp_hash(0/0),in_port(mx-bond),packet_type(ns=0/0,id=0/0),eth(src=72:f9:61:09:b9:79,dst=fa:16:3e:2b:0f:5d),eth_type(0x8100),vlan(vid=405,pcp=0),encap(eth_type(0x0800),ipv4(src=10.10.54.0/255.255.255.128,dst=10.10.54.129,proto=1,tos=0/0,ttl=64,frag=no),icmp(type=0/0,code=0/0)), packets:21, bytes:1764, used:0.810s, dp:tc, actions:ct_clear,ct_clear,set(eth(src=fa:16:3e:84:64:93,dst=f8:f2:1e:03:bf:f2)),set(ipv4(dst=7.7.7.68,ttl=63)),pop_vlan,ct(zone=9),recirc(0x80)

ufid:3e0a0eb7-c1bb-4dba-ae58-0cde3e6ed6d6, skb_priority(0/0),skb_mark(0/0),ct_state(0x22/0x3e),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),recirc_id(0x80),dp_hash(0/0),in_port(mx-bond),packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:84:64:93,dst=f8:f2:1e:03:bf:f2),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=7.7.7.68,proto=1,tos=0/0,ttl=0/0,frag=no),icmp(type=0/0,code=0/0), packets:20, bytes:1680, used:0.810s, offloaded:yes, dp:tc, actions:enp4s0f0_2

Version-Release number of selected component (if applicable):
ovn-2021-21.03.0-40.el8fdp.x86_64

How reproducible:
Always

Steps to Reproduce:
1. configure stateless dnat_snat rule and ping IP.

Actual results:
I believe ct_clear actions called 2 times in the flow and may be the reason for not getting this offlaoded. 

Expected results:
Should have all flows offloaded.

Additional info:

Comment 1 Marcelo Ricardo Leitner 2021-09-16 17:54:21 UTC
(In reply to Haresh Khandelwal from comment #0)
> Description of problem:
> 
> Since statefull NAT is not being offloaded, i tried to explore stateless. 
...
> Flow table:
> 
> ufid:9a58f50a-66c0-4983-849d-d98a4e889af6,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),
> ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp4s0f0_2),packet_type(ns=0/
> 0,id=0/0),eth(src=f8:f2:1e:03:bf:f2,dst=fa:16:3e:84:64:93),eth_type(0x0800),
> ipv4(src=7.7.7.68,dst=0.0.0.0/0.0.0.0,proto=1,tos=0/0,ttl=0/0,frag=no),
> icmp(type=0/0,code=0/0), packets:21, bytes:1764, used:0.810s, offloaded:yes,
> dp:tc, actions:ct(zone=9),recirc(0x81)
                ^^^^

I'm confused. If this should be stateless, why is CT being used?
I was thinking this would go full stateless, but then seems not.

> 
> ufid:907831fd-eb36-450c-8c43-9d17aab6b351,
> skb_priority(0/0),skb_mark(0/0),ct_state(0x2a/0x3e),ct_zone(0/0),ct_mark(0/
> 0),ct_label(0/0x1),recirc_id(0x81),dp_hash(0/0),in_port(enp4s0f0_2),
> packet_type(ns=0/0,id=0/0),eth(src=f8:f2:1e:03:bf:f2,dst=fa:16:3e:84:64:93),
> eth_type(0x0800),ipv4(src=7.7.7.68,dst=10.10.54.100,proto=1,tos=0/0,ttl=64,
> frag=no),icmp(type=0/0,code=0/0), packets:21, bytes:1764, used:0.810s,
> offloaded:yes, dp:tc,
> actions:ct_clear,ct_clear,set(eth(src=fa:16:3e:2b:0f:5d,dst=72:f9:61:09:b9:
> 79)),set(ipv4(src=10.10.54.129,ttl=63)),push_vlan(vid=405,pcp=0),mx-bond

Btw, the NIC has a limitation that it can't set the 5-tuple like this after a ct() action on the same flow.
"btw", because here it's using ct_clear and not ct, so it shouldn't be a problem.

> 
> ufid:844ea973-5817-47f9-a94d-a45d746e0b9e,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0x3f),ct_zone(0/0),ct_mark(0/0),
> ct_label(0/0x1),recirc_id(0),dp_hash(0/0),in_port(mx-bond),packet_type(ns=0/
> 0,id=0/0),eth(src=72:f9:61:09:b9:79,dst=fa:16:3e:2b:0f:5d),eth_type(0x8100),
> vlan(vid=405,pcp=0),encap(eth_type(0x0800),ipv4(src=10.10.54.0/255.255.255.
> 128,dst=10.10.54.129,proto=1,tos=0/0,ttl=64,frag=no),icmp(type=0/0,code=0/
> 0)), packets:21, bytes:1764, used:0.810s, dp:tc,
> actions:ct_clear,ct_clear,set(eth(src=fa:16:3e:84:64:93,dst=f8:f2:1e:03:bf:
> f2)),set(ipv4(dst=7.7.7.68,ttl=63)),pop_vlan,ct(zone=9),recirc(0x80)
> 
> ufid:3e0a0eb7-c1bb-4dba-ae58-0cde3e6ed6d6,
> skb_priority(0/0),skb_mark(0/0),ct_state(0x22/0x3e),ct_zone(0/0),ct_mark(0/
> 0),ct_label(0/0x1),recirc_id(0x80),dp_hash(0/0),in_port(mx-bond),
> packet_type(ns=0/0,id=0/0),eth(src=fa:16:3e:84:64:93,dst=f8:f2:1e:03:bf:f2),
> eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=7.7.7.68,proto=1,tos=0/0,ttl=0/
> 0,frag=no),icmp(type=0/0,code=0/0), packets:20, bytes:1680, used:0.810s,
> offloaded:yes, dp:tc, actions:enp4s0f0_2
> 
> Version-Release number of selected component (if applicable):
> ovn-2021-21.03.0-40.el8fdp.x86_64
> 
> How reproducible:
> Always
> 
> Steps to Reproduce:
> 1. configure stateless dnat_snat rule and ping IP.
> 
> Actual results:
> I believe ct_clear actions called 2 times in the flow and may be the reason
> for not getting this offlaoded. 

They are harmless, actually, and it is getting fixed in:
https://bugzilla.redhat.com/show_bug.cgi?id=1941027

> 
> Expected results:
> Should have all flows offloaded.
> 
> Additional info:

Can you please try that perf probe to capture the extack msg? Even though dp:tc accepted the flow,
the driver probably logged its reasoning on it.

Comment 11 Marcelo Ricardo Leitner 2021-09-24 22:18:44 UTC
To our Nvidia friends:

AFAICT ufid:844ea973-5817-47f9-a94d-a45d746e0b9e on comment #0 is not offloaded because of:

# perf script
handler2  1789 [000] 24404.475681: netlink:netlink_extack: msg=mlx5_core: can't offload re-write of ipv4 address with action ct
        ffffffff9fe037e4 do_trace_netlink_extack+0x44 ([kernel.kallsyms])
        ffffffff9fe037e4 do_trace_netlink_extack+0x44 ([kernel.kallsyms])
        ffffffffc04ee98f actions_match_supported+0x18f ([kernel.kallsyms])
        ffffffffc04f3750 parse_tc_fdb_actions+0x2f0 ([kernel.kallsyms])
        ffffffffc04f4492 __mlx5e_add_fdb_flow+0x192 ([kernel.kallsyms])
        ffffffffc04f4bbe mlx5e_configure_flower+0x64e ([kernel.kallsyms])
        ffffffff9fdf8d7a tc_setup_cb_add+0xca ([kernel.kallsyms])
        ffffffffc0d5abc9 fl_hw_replace_filter+0x139 ([kernel.kallsyms])
        ffffffffc0d60523 fl_change+0xb53 ([kernel.kallsyms])
        ffffffff9fdfccf6 tc_new_tfilter+0x336 ([kernel.kallsyms])
        ffffffff9fda944b rtnetlink_rcv_msg+0x17b ([kernel.kallsyms])
        ffffffff9fe0806c netlink_rcv_skb+0x4c ([kernel.kallsyms])
        ffffffff9fe0788e netlink_unicast+0x19e ([kernel.kallsyms])
        ffffffff9fe07b54 netlink_sendmsg+0x204 ([kernel.kallsyms])
        ffffffff9fd741fc sock_sendmsg+0x4c ([kernel.kallsyms])
        ffffffff9fd7451b ____sys_sendmsg+0x1eb ([kernel.kallsyms])
        ffffffff9fd75b9c ___sys_sendmsg+0x7c ([kernel.kallsyms])
        ffffffff9fd75c67 __sys_sendmsg+0x57 ([kernel.kallsyms])
        ffffffff9f60420b do_syscall_64+0x5b ([kernel.kallsyms])
        ffffffffa00000ad entry_SYSCALL_64_after_hwframe+0x65 ([kernel.kallsyms])
            7f810674c857 __libc_sendmsg+0x47 (/usr/lib64/libpthread-2.28.so)
                       0 [unknown] ([unknown])

The flow has this action set:
actions:ct_clear,ct_clear,set(eth(src=fa:16:3e:84:64:93,dst=f8:f2:1e:03:bf:f2)),set(ipv4(dst=7.7.7.68,ttl=63)),pop_vlan,ct(zone=9),recirc(0x80)

The set() is done *before* the ct(), but driver complained. Reading its code, it checks if ct() is being done ([1]), and then reject the offload when it detects the set() in function is_action_keys_supported() [2].

For now at least I'm just trying to understand the set/ct restriction here. Is it just how the driver was coded or that's expected? I had understood that the order matters and doing the set() before ct() would be okay.

For the record, as I discussed with Haresh, I'm willing to be educated but I don't see value in such stateless NAT while using CT().
This was attempted as an alternative to https://bugzilla.redhat.com/show_bug.cgi?id=1984953 and unless CT usage can be avoided altogether, it's better to fix the original issue in there.

Comment 13 Moshe Levi 2021-09-29 13:52:53 UTC
stateless NAT avoid using the DNAT and SNAT zones in ovn. (which caused not committing the connecting)  
The VM zone is still being used for security groups ( I believe ). I think we need to check if in ovn we can change the order that fist we will do ct and than the header rewrite.

Comment 14 Marcelo Ricardo Leitner 2021-10-27 15:54:48 UTC
Folks, I'm setting priority and impact as Low here per what we discussed on the last call.

Stateless NAT will hit an issue with CT HWOL design and NIC limitations that are much tougher to solve than fixing the original bug.

Considering that stateless NAT does make sense for the DVR use case here, it was attempted mostly as a workaround for the original bug and its usage in there likely doesn't bring noticeable gains because CT HWOL is still used due to Security Groups.


The issue:
                         vv--B
egress: ct() -> match ct -> change 5-tuple -> output
             ^^--- rule break

ingress: change 5-tuple -> ct() -> match ct -> output
                        ^^--A   ^^--- rule break

The order of the actions and matches.
As we need the CT zone to be consistent, in one of the flows it needs a 5-tuple change before the ct() action. But if ct() misses, the packet that will go to the host is already changed, and will trigger a false miss in SW.

The break between ct() and match ct is ok for OVS to do, but there is no easy way to force that in [A] without re-inserting the issue in [B] (like if we just swap the order).

I'm leaving this bug open mostly so we can think about possibly documenting this incompatibility instead.

Comment 15 Moshe Levi 2021-10-28 08:52:26 UTC
so I understand it more default to solve the rewrite ct order. But I just want to raise that openstack ovn already merge [1] to use stateless NAT rules for FIPs.

If stateful nat is the only way, we should have config option to be able to switch between the modes


[1] - https://review.opendev.org/c/openstack/neutron/+/804807

Comment 16 Marcelo Ricardo Leitner 2021-10-28 14:45:01 UTC
Right. Haresh mentioned on the mtg today that he will discuss a way forward for OSP.

Haresh, maybe you wanna take this bug for now?

Comment 17 Ihar Hrachyshka 2021-11-01 17:56:09 UTC
@moshele could you please clarify what's the implication for OSP here? I admit I don't know much about CT design (and what HWOL stands for), but please describe the next steps for OSP here. Was the patch in neutron not needed / detrimental?

Comment 18 Moshe Levi 2021-11-01 20:37:54 UTC
Yes, 

As you remember I suggested to move from stateful NAT FIP to  stateless NT FIP in the community. 
The reason was because we thought ovs hardware offload works with stateless NT FIP. It seem that we did a mistake and the offload only on one direction. 
Now there is a fix in ovn that stateful NAT FIP we can do ovs offload on both direction and the stateless NT FIP is problematic to solve (we have limitation in the design that prevent it to do full offload)

Beacase you PR in neutron change the stateful NAT FIP to stateless NT FIP, we need a way to make it configurable in  networking-ovn. So OSP will deployed hardware offload with stateful NAT FIP

make sense?

Comment 19 Ihar Hrachyshka 2021-11-01 22:15:58 UTC
@moshele we can of course introduce a switch between two values. But before we do that, are we really dealing with two different non-overlapping situations where each value makes sense, but not the other? If we can offload stateful in both directions, should we make it default again (reverting the neutron change)? Or is it that stateless is useful when no hw offloading is enabled? Please elaborate when it still makes sense to use stateless dnat_and_snat. Thanks.

Comment 20 Moshe Levi 2021-11-02 07:24:14 UTC
so I think Sean Mooney(RH) said that it may improve performance for no hw offload case. Maybe it better to get his input on this.

Comment 21 Ihar Hrachyshka 2021-11-04 01:02:40 UTC
At this point, considering we already introduced a neutron patch that has debatable value in real life based on common sense assumptions, it would be good to have actual measurements for both options with and without hw offload, so that we can make a better call on whether both should be supported. Just a thought.

Comment 22 Marcelo Ricardo Leitner 2021-11-09 20:13:19 UTC
Folks, seems this bz is now best assigned to OSP, to track this config option change. Thoughts?

Comment 23 Ihar Hrachyshka 2021-11-29 20:36:20 UTC
Turned out that it was a mistake to switch to stateless snat_and_dnat rules for FIPs in Neutron OVN driver, at least for hw offloaded scenarios. It's not clear if there is any benefit from using stateless for non-offloaded scenarios, though there are some indications / opinions that there should be. We have two options on Neutron side:

1) just revert https://review.opendev.org/c/openstack/neutron/+/804807 (this can't be a clean revert because we will have to migrate OVN db back to stateful);
2) introduce a config option to switch between stateless and stateful depending on offload status.

Since we'd like to avoid adding config options if possible, (2) will need some hard data to justify the complexity. (We did implement the original switch to stateless under assumptions / opinions and not data, and we should probably avoid doing the same here.)

Regardless, the fix would belong to Neutron OVN driver, not OVN. Switching components.

Comment 25 Haresh Khandelwal 2021-11-30 10:33:08 UTC
Thanks Ihar,

(In reply to Ihar Hrachyshka from comment #23)
> Turned out that it was a mistake to switch to stateless snat_and_dnat rules
> for FIPs in Neutron OVN driver, at least for hw offloaded scenarios. It's
> not clear if there is any benefit from using stateless for non-offloaded
> scenarios, though there are some indications / opinions that there should
> be. We have two options on Neutron side:
> 
> 1) just revert https://review.opendev.org/c/openstack/neutron/+/804807 (this
> can't be a clean revert because we will have to migrate OVN db back to
> stateful);
> 2) introduce a config option to switch between stateless and stateful
> depending on offload status.
> 
> Since we'd like to avoid adding config options if possible, (2) will need
> some hard data to justify the complexity. (We did implement the original
> switch to stateless under assumptions / opinions and not data, and we should
> probably avoid doing the same here.)
> 
> Regardless, the fix would belong to Neutron OVN driver, not OVN. Switching
> components.

In ideal scenario, we will have stack further up from Layer 3. That means, FIP is stateless but L4-L7 traffic may need stateful ACLs. I think we don't want to try this combination (FIP stateless, TCP stateful). 
So, it would be ideal to revert the patch as giving option may introduce confusion to customer with permutations and combinations. 

This Bz is originally create to explain stateless NAT offload issue and fix of this is no more on the table as explained in comment#14. 

Thanks

Comment 26 Miro Tomaska 2021-11-30 14:38:03 UTC
Haresh, we would like to see more data if possible. Have we done performance benchmarks on stateless vs stateful?

Comment 27 Haresh Khandelwal 2021-12-01 12:31:43 UTC
(In reply to Miro Tomaska from comment #26)
> Haresh, we would like to see more data if possible. Have we done performance
> benchmarks on stateless vs stateful?

Hi Miro,

We have not tried stateless vs stateful with perf data yet. 
May i know why you need this data? Is there any decision impacting it? 

Thanks
-Haresh

Comment 28 Ihar Hrachyshka 2021-12-01 18:10:00 UTC
Considering comment 25, I believe reverting to stateful with no option is the best approach since it's confirmed that permutations would complicate the matter for unknown benefit.

Comment 31 Ihar Hrachyshka 2022-04-20 19:19:00 UTC
Patch to revert the upstream patch here: https://review.opendev.org/c/openstack/neutron/+/838776


Note You need to log in before you can comment on or make changes to this bug.