2161281 – "SNAT in separate zone from DNAT" test fails due to OVN issues

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2161281 - "SNAT in separate zone from DNAT" test fails due to OVN issues

Summary: "SNAT in separate zone from DNAT" test fails due to OVN issues

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	ovn23.06
Sub Component:
Version:	FDP 22.H
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Ales Musil
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2203012 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-01-16 13:24 UTC by xsimonar
Modified:	2024-01-24 11:05 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ovn23.06-23.06.0-beta.118.el8fdp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-01-24 11:05:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-2609	0	None	None	None	2023-01-16 13:28:39 UTC
Red Hat Product Errata	RHBA-2024:0388	0	None	None	None	2024-01-24 11:05:14 UTC

Description xsimonar 2023-01-16 13:24:55 UTC

This issue is highlighted by system test "SNAT in separate zone from DNAT".
S1 == R1 == public == R2 == S2

There are three issues related to this test case.

1) Different behavior on Ubuntu/Fedora
The test succeed on github, Ubuntu 20.04 and Fedora 35.
It fails on Fedora 34.
On all those OSes, the OVN behavior is similar: an echo request is sent to a load balancer. The echo reply is received with a wrong source address (the packets is not unDNAtted and received with the LB backed IP).
The different behavior on the OSes is due to different ping versions, and how ping behaves when it received echo reply with a wrong source address. It fails on Fedora 34, but succeeds, with a warning on Fedora 35.
The test should fail on all OSes as the unDNAtted has not been done properly; this is a test related issue and it can be fixed by properly checking the conntrack

2) The unDNAT is not done properly for any icmp reply
unDNAT flows in lr_out_undnat expects (for the return traffic in R1) the outport to be a l3dgw_port. This is not the case in this scenario (where the l3dgw_port is an inport).
This causes the unDNAT to never happen

3) Even if previous is fixed (i.e. if unDNAT flows got hit), the first icmp reply is not unDNATted properly.
For the first echo request:
- The packet is dnatted (in dnat zone) in table lr_in_dnat, and "ct_mark=2" is set.
- The packet is sent to controller (pinctrl) and is resubmitted to ovs in table 37
ct_mark is lost and SNAT and DNAT try to use the same zone.

Comment 1 Ales Musil 2023-01-27 12:57:01 UTC

After discussion with Xavier and poking around with the test there seem to be indeed two issues. 

1) Is with first traffic. Because of the ARP resolution we lose the ct_mark/label which results in SNAT happening in the common zone. 
   This has the following consequence:
   - The original traffic arrives to destination and CT entry for SNAT is created in common zone.
   - The reply traffic goes through LR pipeline and hits bug 2 (described down below), the unSNAT happens in common zone, but 
     because of conflict the unDNAT cannot be done properly (they are in the same zone). 
   - The traffic arrives with wrong source address.

2) In ingress router pipeline the unSNAT does not have a proper state matching when it should do the unSNAT in separate or common zone. 
   - There are logical flows that differentiate between common and separate SNAT zone:
     table=4 (lr_in_unsnat       ), priority=100  , match=(ip && ip4.dst == 172.16.0.101 && inport == "r1_public" && flags.loopback == 0 && is_chassis_resident("cr-r1_public")), action=(ct_snat_in_czone;)
     table=4 (lr_in_unsnat       ), priority=100  , match=(ip && ip4.dst == 172.16.0.101 && inport == "r1_public" && flags.loopback == 1 && flags.use_snat_zone == 1 && is_chassis_resident("cr-r1_public")), action=(ct_snat;)
   - In order to do unSNAT in separate zone we need to have loopback=1 and use_snat_zone=1, those two conditions are set only if
     the traffic is local e.g. hairpin and is sent back to the same port via "lr_out_egr_loop".
   - This also has the consequence, that once the MAC binding is learned and CT entry from the common zone one expires, every traffic
     is dropped because it's not properly unSNATted. 
   

The outcome is that SNAT and LB done on distributed router ports is suffering from this issue.
One of the possibilities how to fix this issue is to use the separate zone for SNAT every time,
however I'm not sure if that would have any other consequences.

Comment 2 xsimonar 2023-01-30 14:22:05 UTC

The behavior when CT entry related to 1st ping (snat in the common zone) has expired has changed recently a few times:
Before commit "northd: Add logical flow to defrag ICMP traffic", there was a return packet with wrong src address (return packet was not undnatted)
Then, before commit "northd: Drop packets destined to router owned NAT IP for DGP", it ... worked (correct reply packet).
Then, as indicated above, after that commit, there is no reply packet anymore.

The reason what it "worked" (once initial CT entry has been cleared) is the following:
- echo request is dnatted in dnat zone and snatted in snat zone. 
- for reply packet, unsnat fails (as we try to unsnat in common/dnat zone, hitting rule w/ loopback == 0)
- dst of the reply packet remains 172.16.0.102
- packet is re-routed the same router (r1), but this time with loopback bit set
- unsnat is done in correct zone (hitting rule w/ flags.loopback == 1)
- it would not hit the undnat rule as outport is wrong
  table=1 (lr_out_undnat      ), priority=120  , match=(ip4 && ((ip4.src == 172.16.0.102)) && outport == "r1_public" && is_chassis_resident("cr-r1_public")), action=(ct_dnat_in_czone;)
- but it hits 
  table=5 (lr_in_defrag       ), priority=50   , match=(icmp || icmp6), action=(ct_dnat;)

Comment 3 Ales Musil 2023-01-30 15:04:00 UTC

To sum up what the solution should look like: 

Have a config knob that allows user to specify to use always separate zones for SNAT and DNAT or use common zone when possible. 
The reason for the knob is that the common zone was needed for HWOL and we should still allow this behavior.
Also the default behavior should be correct one -> separate zones allowing user that needs HWOL to go back to the "old" behavior.

Comment 4 Ales Musil 2023-02-10 09:31:58 UTC

Patch posted: https://patchwork.ozlabs.org/project/ovn/patch/20230210092049.603012-1-amusil@redhat.com/

Comment 5 Dan Williams 2023-04-26 12:01:58 UTC

Accepted patchset is https://patchwork.ozlabs.org/project/ovn/list/?series=350439&archive=both&state=*

Comment 6 OVN Bot 2023-05-11 04:10:25 UTC

ovn23.06 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2203012
ovn23.06 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2203013

Comment 9 Jianlin Shi 2023-12-04 02:36:40 UTC

*** Bug 2203012 has been marked as a duplicate of this bug. ***

Comment 10 Jianlin Shi 2023-12-04 03:37:34 UTC

use the reproducer in https://bugzilla.redhat.com/show_bug.cgi?id=2203013#c3.

reproduced on ovn23.03-23.03.0-101.el8:

[root@kvm-02-guest29 bz2161281]# rpm -qa | grep -E "ovn23.03|openvswitch3.1"             
openvswitch3.1-3.1.0-70.el8fdp.x86_64                                                    
ovn23.03-central-23.03.0-101.el8fdp.x86_64                                               
ovn23.03-23.03.0-101.el8fdp.x86_64                                                       
ovn23.03-host-23.03.0-101.el8fdp.x86_64

[root@kvm-02-guest29 bz2161281]# ip netns exec vm1 ping 30.0.0.1 -c 1 -w 2               
PING 30.0.0.1 (30.0.0.1) 56(84) bytes of data.
64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=36.3 ms                               

--- 30.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms                              
rtt min/avg/max/mdev = 36.344/36.344/36.344/0.000 ms

[root@kvm-02-guest29 ~]# ip netns exec vm1 tcpdump -i vm1 -nnle -v                       
dropped privs to tcpdump                                                                 
tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes        
22:33:26.769369 00:de:ad:01:00:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 173.0.1.1 tell 173.0.1.2, length 28
22:33:26.769631 00:de:ad:fe:00:01 > 00:de:ad:01:00:01, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 173.0.1.1 is-at 00:de:ad:fe:00:01, length 28
22:33:26.769638 00:de:ad:01:00:01 > 00:de:ad:fe:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 33918, offset 0, flags [DF], proto ICMP (1), length 84)
    173.0.1.2 > 30.0.0.1: ICMP echo request, id 18110, seq 1, length 64                  
22:33:26.805694 00:de:ad:fe:00:01 > 00:de:ad:01:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 31610, offset 0, flags [none], proto ICMP (1), length 84)
    172.16.0.102 > 173.0.1.2: ICMP echo reply, id 18110, seq 1, length 64

<=== src ip is not un-dnated

Verified on ovn23.06-23.06.1-60.el8:

[root@kvm-02-guest29 bz2161281]# ip netns exec vm1 ping 30.0.0.1 -c 1 -w 2               
PING 30.0.0.1 (30.0.0.1) 56(84) bytes of data.                                           
64 bytes from 30.0.0.1: icmp_seq=1 ttl=62 time=31.5 ms                                   
                                                                                         
--- 30.0.0.1 ping statistics ---                                                         
1 packets transmitted, 1 received, 0% packet loss, time 0ms                              
rtt min/avg/max/mdev = 31.542/31.542/31.542/0.000 ms                                     
[root@kvm-02-guest29 bz2161281]# rpm -qa | grep -E "ovn23.06|openvswitch3.1"             
openvswitch3.1-3.1.0-70.el8fdp.x86_64                                                    
ovn23.06-23.06.1-60.el8fdp.x86_64                                                        
ovn23.06-central-23.06.1-60.el8fdp.x86_64                                                
ovn23.06-host-23.06.1-60.el8fdp.x86_64

[root@kvm-02-guest29 ~]# ip netns exec vm1 tcpdump -i vm1 -nnle -v not ip6               
dropped privs to tcpdump
tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes        
22:36:56.063825 00:de:ad:01:00:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 173.0.1.1 tell 173.0.1.2, length 28
22:36:56.064642 00:de:ad:fe:00:01 > 00:de:ad:01:00:01, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 173.0.1.1 is-at 00:de:ad:fe:00:01, length 28
22:36:56.064651 00:de:ad:01:00:01 > 00:de:ad:fe:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 36021, offset 0, flags [DF], proto ICMP (1), length 84)
    173.0.1.2 > 30.0.0.1: ICMP echo request, id 19434, seq 1, length 64                  
22:36:56.095345 00:de:ad:fe:00:01 > 00:de:ad:01:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 7443, offset 0, flags [none], proto ICMP (1), length 84)
    30.0.0.1 > 173.0.1.2: ICMP echo reply, id 19434, seq 1, length 64

Comment 12 errata-xmlrpc 2024-01-24 11:05:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn23.06 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:0388

Note You need to log in before you can comment on or make changes to this bug.