Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2172048

Summary: lsp failed to ping it's own floating ip with special configuration
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Jianlin Shi <jishi>
Component: ovn23.03Assignee: Ales Musil <amusil>
Status: CLOSED CURRENTRELEASE QA Contact: Jianlin Shi <jishi>
Severity: medium Docs Contact:
Priority: medium    
Version: FDP 23.BCC: amusil, ctrautma, jiji, mmichels
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovn23.03-23.03.0-11.el8fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-24 07:02:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jianlin Shi 2023-02-21 10:03:41 UTC
Description of problem:
lsp failed to ping it's own floating ip under special configuration

Version-Release number of selected component (if applicable):
ovn22.09-22.09.0-59.el8

How reproducible:
Always

Steps to Reproduce:
1. setup on server
systemctl start openvswitch                          
systemctl start ovn-northd                                                                            
ovn-nbctl set-connection ptcp:6641                                                                    
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.203.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.203.25
systemctl restart ovn-controller
systemctl restart openvswitch

ovs-vsctl add-br br-ext
ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=phynet:br-ext
ovs-vsctl add-port br-ext ens1f1
ip link set ens1f1 up

ovn-nbctl lr-add lr1
ovn-nbctl lrp-add lr1 lr1-ls1 00:00:01:ff:02:03 192.168.1.254/24

ovn-nbctl ls-add ls1
ovn-nbctl lsp-add ls1 ls1p1
ovn-nbctl lsp-set-addresses ls1p1 "00:00:01:01:01:01 192.168.1.1"

ovn-nbctl lsp-add ls1 ls1p2
ovn-nbctl lsp-add ls1 ls1p2.11 ls1p2 11
ovn-nbctl lsp-set-addresses ls1p2.11 "00:00:01:01:01:11 192.168.1.11"

ovn-nbctl lsp-add ls1 ls1-lr1
ovn-nbctl lsp-set-type ls1-lr1 router
ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1
ovn-nbctl lsp-set-addresses ls1-lr1 router

ovn-nbctl ls-add pub
ovn-nbctl lrp-add lr1 lr1-pub 00:00:01:ff:01:03 172.16.1.1/24

ovn-nbctl lsp-add pub pub-lr1
ovn-nbctl lsp-set-type pub-lr1 router
ovn-nbctl lsp-set-addresses pub-lr1 router
ovn-nbctl lsp-set-options pub-lr1 router-port=lr1-pub

ovn-nbctl lsp-add pub pub-ln
ovn-nbctl lsp-set-type pub-ln localnet
ovn-nbctl lsp-set-addresses pub-ln unknown
ovn-nbctl lsp-set-options pub-ln network_name=phynet

ovn-nbctl lb-add lb_r1_tcp 172.16.1.101:50001 192.168.1.11:50001,192.168.1.1:50001 tcp
ovn-nbctl lr-lb-add lr1 lb_r1_tcp

ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal external_ids:iface-id=ls1p1
ip netns add ls1p1
ip link set ls1p1 netns ls1p1
ip netns exec ls1p1 ip link set ls1p1 address 00:00:01:01:01:01
ip netns exec ls1p1 ip link set ls1p1 up
ip netns exec ls1p1 ip addr add 192.168.1.1/24 dev ls1p1
ip netns exec ls1p1 ip route add default via 192.168.1.254


ovs-vsctl add-port br-int ls1p2 -- set interface ls1p2 type=internal external_ids:iface-id=ls1p2
ip link add link ls1p2 name ls1p2.11 type vlan id 11
ip link set ls1p2 up

ip netns add ls1p2.11
ip link set ls1p2.11 netns ls1p2.11
ip netns exec ls1p2.11 ip link set ls1p2.11 address 00:00:01:01:01:11
ip netns exec ls1p2.11 ip link set ls1p2.11 up
ip netns exec ls1p2.11 ip addr add 192.168.1.11/24 dev ls1p2.11
ip netns exec ls1p2.11 ip route add default via 192.168.1.254

2. setup on client

systemctl start openvswitch                          
systemctl start ovn-northd                                                                            
ovn-nbctl set-connection ptcp:6641                                                                    
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.203.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.203.26
systemctl restart ovn-controller
systemctl restart openvswitch

ovs-vsctl add-br br-ext
ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=phynet:br-ext
ovs-vsctl add-port br-ext ens1f1
ip link set ens1f1 up


3. set gateway chassis on server

client1=wsfd-advnetlab17.anl.lab.eng.bos.redhat.com                                                   
#hv1_system_id=$(ovn-sbctl find chassis hostname=$(hostname) | awk '/^name/{print $3}' | sed 's/"//g')
hv0_system_id=$(ovn-sbctl find chassis hostname=$client1 | awk '/^name/{print $3}' | sed 's/"//g')    
ovn-nbctl ha-chassis-group-add hagrp1                                                                 
ovn-nbctl ha-chassis-group-add-chassis hagrp1 $hv0_system_id 100                                                                           
group1_id=$(ovn-nbctl get ha_chassis_group hagrp1 _uuid)                                              
                                                                                                      
ovn-nbctl set logical_router_port lr1-pub ha_chassis_group=$group1_id                                 
                                                                                                      
ovn-nbctl lr-nat-add lr1 dnat_and_snat 172.16.1.11 192.168.1.1 ls1p1 00:00:00:0a:0a:11                
ovn-nbctl lr-nat-add lr1 dnat_and_snat 172.16.1.21 192.168.1.11 ls1p2.11 00:00:00:0a:0a:21

4. ip netns exec ls1p1 ping 172.16.1.11 -c 1

Actual results:
failed

Expected results:
pass

Additional info:

[root@wsfd-advnetlab16 test]# rpm -qa | grep -E "openvswitch2.17|ovn22.09"                            
openvswitch2.17-2.17.0-74.el8fdp.x86_64                                                               
ovn22.09-host-22.09.0-59.el8fdp.x86_64                                                                
ovn22.09-22.09.0-59.el8fdp.x86_64                                                                     
ovn22.09-central-22.09.0-59.el8fdp.x86_64                                                             
python3-openvswitch2.17-2.17.0-74.el8fdp.x86_64

[root@wsfd-advnetlab16 scenario]# ip netns exec ls1p1 tcpdump -i ls1p1 -nnle -v                       
dropped privs to tcpdump                                                                              
tcpdump: listening on ls1p1, link-type EN10MB (Ethernet), capture size 262144 bytes                   
05:00:07.340215 00:00:01:01:01:01 > 00:00:01:ff:02:03, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 2865, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.1.1 > 172.16.1.11: ICMP echo request, id 32760, seq 1, length 64                          
05:00:07.343209 00:00:01:ff:02:03 > 00:00:01:01:01:01, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 2865, offset 0, flags [DF], proto ICMP (1), length 84)
    172.16.1.11 > 192.168.1.1: ICMP echo request, id 32760, seq 1, length 64                          
05:00:07.343272 00:00:01:01:01:01 > 00:00:01:ff:02:03, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 2866, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.1.1 > 172.16.1.11: ICMP echo reply, id 32760, seq 1, length 64

if I remove the lb, the ping would pass
and if set hagrp1 to hv1, the ping would also pass.
btw, tcp can pass

and the issue didn't exist on ovn22.09-22.09.0-47.el8

Comment 1 Mark Michelson 2023-02-28 14:47:44 UTC
List of commits between ovn22.09-22.09.0-47 and ovn22.09-22.09.0-59

6b7d2d835 actions: Add new action called ct_commit_nat
ec4474b0f northd: Allow related traffic through LB
17134a2f5 northd: Add logical flow to defrag ICMP traffic
33020b776 northd: Store skip_snat and force_snat in ct_label/mark
545989afd northd: Add flag for CT related
26bb3fa8f northd: Add logical flows to allow rpl/rel traffic in acl_after_lb stage.
3a9aabb09 ovn-trace: Use the original ovnact for execute_load
3d9a8ca40 Add the metalLB install flag for CI actions
d0c83ca47 ovn-macros: support ipv6 in ovn_attach
f90f3961b pinctrl: Send RARPs for external ipv6 interfaces
c538e6bc3 tests: Fix flaky test "IPv6 Neighbor Solicitation for unknown MAC"
6ba977581 northd.c: Validate port type to avoid unexpected behavior.

Of those, the bottom 6 should have nothing to do with this regression. It is likely one of the top 6 that resulted in this issue. This should make it relatively easy for an OVN dev to bisect and find the offending commit.

My uneducated guess is that since removing the LB causes the problem to go away, it is likely

17134a2f5 northd: Add logical flow to defrag ICMP traffic

that is causing the problem. This commit causes ICMP traffic to go through a "ct_dnat" action when a load balancer with a port is configured, which may result in "ct.inv" for this packet.

@jishi Can you re-run the test, but add the following line after northd is started:

    ovn-nbctl set NB_Global . options:use_ct_inv_match=false

Check if this causes the test to pass. If it does, then we at least know what is causing the issue to happen. It doesn't mean we can close this bug, but it at least gives us a root cause for the regression. Thanks.

Comment 2 Jianlin Shi 2023-03-01 01:59:57 UTC
(In reply to Mark Michelson from comment #1)
> List of commits between ovn22.09-22.09.0-47 and ovn22.09-22.09.0-59
> 
> 6b7d2d835 actions: Add new action called ct_commit_nat
> ec4474b0f northd: Allow related traffic through LB
> 17134a2f5 northd: Add logical flow to defrag ICMP traffic
> 33020b776 northd: Store skip_snat and force_snat in ct_label/mark
> 545989afd northd: Add flag for CT related
> 26bb3fa8f northd: Add logical flows to allow rpl/rel traffic in acl_after_lb
> stage.
> 3a9aabb09 ovn-trace: Use the original ovnact for execute_load
> 3d9a8ca40 Add the metalLB install flag for CI actions
> d0c83ca47 ovn-macros: support ipv6 in ovn_attach
> f90f3961b pinctrl: Send RARPs for external ipv6 interfaces
> c538e6bc3 tests: Fix flaky test "IPv6 Neighbor Solicitation for unknown MAC"
> 6ba977581 northd.c: Validate port type to avoid unexpected behavior.
> 
> Of those, the bottom 6 should have nothing to do with this regression. It is
> likely one of the top 6 that resulted in this issue. This should make it
> relatively easy for an OVN dev to bisect and find the offending commit.
> 
> My uneducated guess is that since removing the LB causes the problem to go
> away, it is likely
> 
> 17134a2f5 northd: Add logical flow to defrag ICMP traffic
> 
> that is causing the problem. This commit causes ICMP traffic to go through a
> "ct_dnat" action when a load balancer with a port is configured, which may
> result in "ct.inv" for this packet.
> 
> @jishi Can you re-run the test, but add the following line after
> northd is started:
> 
>     ovn-nbctl set NB_Global . options:use_ct_inv_match=false
> 
> Check if this causes the test to pass. If it does, then we at least know
> what is causing the issue to happen. It doesn't mean we can close this bug,
> but it at least gives us a root cause for the regression. Thanks.

still failed even after adding the configuration:

[root@dell-per740-67 bz2172048]# ip netns exec ls1p1 ping 172.16.1.11 -c 1                            
PING 172.16.1.11 (172.16.1.11) 56(84) bytes of data.                                                  

--- 172.16.1.11 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms                                         

[root@dell-per740-67 bz2172048]# ovn-nbctl list nb_global
_uuid               : 6214443d-4e5b-4b1a-8da1-e8557f59161a
connections         : [4e07cafb-4b9c-4d7b-bb9d-a6d06e85f9cf]                                          
external_ids        : {}                                                                              
hv_cfg              : 0                                                                               
hv_cfg_timestamp    : 0
ipsec               : false                                                                           
name                : ""                                                                              
nb_cfg              : 0                                                                               
nb_cfg_timestamp    : 0
options             : {mac_prefix="7e:d8:05", max_tunid="16711680", northd_internal_version="22.09.2-20.25.0-69.4", svc_monitor_mac="26:90:ad:71:93:2f", use_ct_inv_match="false"}
sb_cfg              : 0                                                                               
sb_cfg_timestamp    : 0                                                                               
ssl                 : []

Comment 3 Ales Musil 2023-03-01 12:45:26 UTC
As Mark pointed out it is because of the 17134a2f5 northd: Add logical flow to defrag ICMP traffic.

We need to limit this flow for the LB VIP, I'll create a fix for that.

Comment 4 OVN Bot 2023-03-28 04:06:54 UTC
ovn23.03 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2182255
ovn23.03 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2182256

Comment 5 Ales Musil 2023-05-24 07:02:57 UTC
It is fixed in 23.03 and further.