This issue is highlighted by system test "SNAT in separate zone from DNAT". S1 == R1 == public == R2 == S2 There are three issues related to this test case. 1) Different behavior on Ubuntu/Fedora The test succeed on github, Ubuntu 20.04 and Fedora 35. It fails on Fedora 34. On all those OSes, the OVN behavior is similar: an echo request is sent to a load balancer. The echo reply is received with a wrong source address (the packets is not unDNAtted and received with the LB backed IP). The different behavior on the OSes is due to different ping versions, and how ping behaves when it received echo reply with a wrong source address. It fails on Fedora 34, but succeeds, with a warning on Fedora 35. The test should fail on all OSes as the unDNAtted has not been done properly; this is a test related issue and it can be fixed by properly checking the conntrack 2) The unDNAT is not done properly for any icmp reply unDNAT flows in lr_out_undnat expects (for the return traffic in R1) the outport to be a l3dgw_port. This is not the case in this scenario (where the l3dgw_port is an inport). This causes the unDNAT to never happen 3) Even if previous is fixed (i.e. if unDNAT flows got hit), the first icmp reply is not unDNATted properly. For the first echo request: - The packet is dnatted (in dnat zone) in table lr_in_dnat, and "ct_mark=2" is set. - The packet is sent to controller (pinctrl) and is resubmitted to ovs in table 37 ct_mark is lost and SNAT and DNAT try to use the same zone.
After discussion with Xavier and poking around with the test there seem to be indeed two issues. 1) Is with first traffic. Because of the ARP resolution we lose the ct_mark/label which results in SNAT happening in the common zone. This has the following consequence: - The original traffic arrives to destination and CT entry for SNAT is created in common zone. - The reply traffic goes through LR pipeline and hits bug 2 (described down below), the unSNAT happens in common zone, but because of conflict the unDNAT cannot be done properly (they are in the same zone). - The traffic arrives with wrong source address. 2) In ingress router pipeline the unSNAT does not have a proper state matching when it should do the unSNAT in separate or common zone. - There are logical flows that differentiate between common and separate SNAT zone: table=4 (lr_in_unsnat ), priority=100 , match=(ip && ip4.dst == 172.16.0.101 && inport == "r1_public" && flags.loopback == 0 && is_chassis_resident("cr-r1_public")), action=(ct_snat_in_czone;) table=4 (lr_in_unsnat ), priority=100 , match=(ip && ip4.dst == 172.16.0.101 && inport == "r1_public" && flags.loopback == 1 && flags.use_snat_zone == 1 && is_chassis_resident("cr-r1_public")), action=(ct_snat;) - In order to do unSNAT in separate zone we need to have loopback=1 and use_snat_zone=1, those two conditions are set only if the traffic is local e.g. hairpin and is sent back to the same port via "lr_out_egr_loop". - This also has the consequence, that once the MAC binding is learned and CT entry from the common zone one expires, every traffic is dropped because it's not properly unSNATted. The outcome is that SNAT and LB done on distributed router ports is suffering from this issue. One of the possibilities how to fix this issue is to use the separate zone for SNAT every time, however I'm not sure if that would have any other consequences.
The behavior when CT entry related to 1st ping (snat in the common zone) has expired has changed recently a few times: Before commit "northd: Add logical flow to defrag ICMP traffic", there was a return packet with wrong src address (return packet was not undnatted) Then, before commit "northd: Drop packets destined to router owned NAT IP for DGP", it ... worked (correct reply packet). Then, as indicated above, after that commit, there is no reply packet anymore. The reason what it "worked" (once initial CT entry has been cleared) is the following: - echo request is dnatted in dnat zone and snatted in snat zone. - for reply packet, unsnat fails (as we try to unsnat in common/dnat zone, hitting rule w/ loopback == 0) - dst of the reply packet remains 172.16.0.102 - packet is re-routed the same router (r1), but this time with loopback bit set - unsnat is done in correct zone (hitting rule w/ flags.loopback == 1) - it would not hit the undnat rule as outport is wrong table=1 (lr_out_undnat ), priority=120 , match=(ip4 && ((ip4.src == 172.16.0.102)) && outport == "r1_public" && is_chassis_resident("cr-r1_public")), action=(ct_dnat_in_czone;) - but it hits table=5 (lr_in_defrag ), priority=50 , match=(icmp || icmp6), action=(ct_dnat;)
To sum up what the solution should look like: Have a config knob that allows user to specify to use always separate zones for SNAT and DNAT or use common zone when possible. The reason for the knob is that the common zone was needed for HWOL and we should still allow this behavior. Also the default behavior should be correct one -> separate zones allowing user that needs HWOL to go back to the "old" behavior.
Patch posted: https://patchwork.ozlabs.org/project/ovn/patch/20230210092049.603012-1-amusil@redhat.com/
Accepted patchset is https://patchwork.ozlabs.org/project/ovn/list/?series=350439&archive=both&state=*
ovn23.06 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2203012 ovn23.06 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2203013
*** Bug 2203012 has been marked as a duplicate of this bug. ***
use the reproducer in https://bugzilla.redhat.com/show_bug.cgi?id=2203013#c3. reproduced on ovn23.03-23.03.0-101.el8: [root@kvm-02-guest29 bz2161281]# rpm -qa | grep -E "ovn23.03|openvswitch3.1" openvswitch3.1-3.1.0-70.el8fdp.x86_64 ovn23.03-central-23.03.0-101.el8fdp.x86_64 ovn23.03-23.03.0-101.el8fdp.x86_64 ovn23.03-host-23.03.0-101.el8fdp.x86_64 [root@kvm-02-guest29 bz2161281]# ip netns exec vm1 ping 30.0.0.1 -c 1 -w 2 PING 30.0.0.1 (30.0.0.1) 56(84) bytes of data. 64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=36.3 ms --- 30.0.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 36.344/36.344/36.344/0.000 ms [root@kvm-02-guest29 ~]# ip netns exec vm1 tcpdump -i vm1 -nnle -v dropped privs to tcpdump tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes 22:33:26.769369 00:de:ad:01:00:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 173.0.1.1 tell 173.0.1.2, length 28 22:33:26.769631 00:de:ad:fe:00:01 > 00:de:ad:01:00:01, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 173.0.1.1 is-at 00:de:ad:fe:00:01, length 28 22:33:26.769638 00:de:ad:01:00:01 > 00:de:ad:fe:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 33918, offset 0, flags [DF], proto ICMP (1), length 84) 173.0.1.2 > 30.0.0.1: ICMP echo request, id 18110, seq 1, length 64 22:33:26.805694 00:de:ad:fe:00:01 > 00:de:ad:01:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 31610, offset 0, flags [none], proto ICMP (1), length 84) 172.16.0.102 > 173.0.1.2: ICMP echo reply, id 18110, seq 1, length 64 <=== src ip is not un-dnated Verified on ovn23.06-23.06.1-60.el8: [root@kvm-02-guest29 bz2161281]# ip netns exec vm1 ping 30.0.0.1 -c 1 -w 2 PING 30.0.0.1 (30.0.0.1) 56(84) bytes of data. 64 bytes from 30.0.0.1: icmp_seq=1 ttl=62 time=31.5 ms --- 30.0.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 31.542/31.542/31.542/0.000 ms [root@kvm-02-guest29 bz2161281]# rpm -qa | grep -E "ovn23.06|openvswitch3.1" openvswitch3.1-3.1.0-70.el8fdp.x86_64 ovn23.06-23.06.1-60.el8fdp.x86_64 ovn23.06-central-23.06.1-60.el8fdp.x86_64 ovn23.06-host-23.06.1-60.el8fdp.x86_64 [root@kvm-02-guest29 ~]# ip netns exec vm1 tcpdump -i vm1 -nnle -v not ip6 dropped privs to tcpdump tcpdump: listening on vm1, link-type EN10MB (Ethernet), capture size 262144 bytes 22:36:56.063825 00:de:ad:01:00:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 173.0.1.1 tell 173.0.1.2, length 28 22:36:56.064642 00:de:ad:fe:00:01 > 00:de:ad:01:00:01, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 173.0.1.1 is-at 00:de:ad:fe:00:01, length 28 22:36:56.064651 00:de:ad:01:00:01 > 00:de:ad:fe:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 36021, offset 0, flags [DF], proto ICMP (1), length 84) 173.0.1.2 > 30.0.0.1: ICMP echo request, id 19434, seq 1, length 64 22:36:56.095345 00:de:ad:fe:00:01 > 00:de:ad:01:00:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 7443, offset 0, flags [none], proto ICMP (1), length 84) 30.0.0.1 > 173.0.1.2: ICMP echo reply, id 19434, seq 1, length 64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn23.06 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:0388