The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1952961 - [ovn] dnat_snat traffic becomes centralized during VIP failover
Summary: [ovn] dnat_snat traffic becomes centralized during VIP failover
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.13
Version: FDP 21.B
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: ---
Assignee: lorenzo bianconi
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks: 2035079 2083527
TreeView+ depends on / blocked
 
Reported: 2021-04-23 16:20 UTC by Jakub Libosvar
Modified: 2023-09-18 00:26 UTC (History)
10 users (show)

Fixed In Version: ovn2.13-20.12.0-150.el8fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2035079 (view as bug list)
Environment:
Last Closed: 2022-12-15 00:21:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-1276 0 None None None 2021-08-10 20:20:12 UTC
Red Hat Product Errata RHBA-2022:9044 0 None None None 2022-12-15 00:21:52 UTC

Description Jakub Libosvar 2021-04-23 16:20:06 UTC
Description of problem:
We have an OCP environment with VIP that bounces around when failover happens. This VIP is associated with a FIP and the environment is DVR. In busy environment, it may take some time for ovn-controller to recompute the openflows when the virtual parent changes. We see the traffic becomes centralized and goes through the gw chassis for some time - about 17 seconds.

When there is an established TCP connection, the connection starts to send out packets with source MAC of the FIP and the fabric learns the switch is on the gw chassis and there is a race between garps and the tcp traffic.

Then the switch plugged to gw node learns the FIP is there and not on the compute node hosting the instance.

Version-Release number of selected component (if applicable):
ovn2.13-20.12.0-104.el8fdp.x86_64

How reproducible:
Always on busy environment

Steps to Reproduce:
1. Establish TCP connection to the FIP
2. Run tcpdump with source mac of FIP on the gateway node
3. Do failover of VIP associated with the FIP


Actual results:
Traffic goes through the gateway node

Expected results:
Traffic is always distributed and changes to new node once everything is set

Additional info:

Comment 1 Jakub Libosvar 2021-04-26 16:08:23 UTC
Just to emphasise the outcome - the FIP becomes unreachable for some time until switches learn the right port where the mac is. Is it possible that some flows are removed when OVN claims the virtual port and virtual parents are updated - that the traffic becomes centralized by mistake because of the way flows are matched?

Comment 8 Jianlin Shi 2021-08-16 04:26:09 UTC
Tested with following script:

systemctl start openvswitch                                                                           
systemctl start ovn-northd                                                                            
ovn-nbctl set-connection ptcp:6641                                                                    
ovn-sbctl set-connection ptcp:6642                                                                    
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:1.1.170.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.170.25
systemctl restart ovn-controller

ovs-vsctl add-br br-public
ovs-vsctl set open . external-ids:ovn-bridge-mappings=public:br-public
ovs-vsctl add-port br-public p1p2

ovn-nbctl ls-add sw0

ovn-nbctl lsp-add sw0 sw0-vir
ovn-nbctl lsp-set-addresses sw0-vir "50:54:00:00:00:10 10.0.0.10"                               
ovn-nbctl lsp-set-port-security sw0-vir "50:54:00:00:00:10 10.0.0.10"                           
ovn-nbctl lsp-set-type sw0-vir virtual                                                          
ovn-nbctl set logical_switch_port sw0-vir options:virtual-ip=10.0.0.10                          
ovn-nbctl set logical_switch_port sw0-vir options:virtual-parents=sw0-p1,sw0-p2

ovn-nbctl lsp-add sw0 sw0-p1
ovn-nbctl lsp-set-addresses sw0-p1 "50:54:00:00:00:03 10.0.0.3"

ovn-nbctl lsp-add sw0 sw0-p2
ovn-nbctl lsp-set-addresses sw0-p2 "50:54:00:00:00:04 10.0.0.4"

ovn-nbctl lr-add lr0
ovn-nbctl lrp-add lr0 lr0-sw0 00:00:00:00:ff:01 10.0.0.1/24
ovn-nbctl lsp-add sw0 sw0-lr0
ovn-nbctl lsp-set-type sw0-lr0 router
ovn-nbctl lsp-set-addresses sw0-lr0 00:00:00:00:ff:01
ovn-nbctl lsp-set-options sw0-lr0 router-port=lr0-sw0

ovn-nbctl ls-add public
ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24
ovn-nbctl lsp-add public public-lr0
ovn-nbctl lsp-set-type public-lr0 router
ovn-nbctl lsp-set-addresses public-lr0 router
ovn-nbctl lsp-set-options public-lr0 router-port=lr0-public

ovn-nbctl lsp-add public ln-public
ovn-nbctl lsp-set-type ln-public localnet
ovn-nbctl lsp-set-addresses ln-public unknown
ovn-nbctl lsp-set-options ln-public network_name=public


ovn-nbctl --wait=hv lrp-set-gateway-chassis lr0-public hv1 20

ovn-nbctl lr-nat-add lr0 dnat_and_snat 172.168.0.50 10.0.0.10 sw0-vir 10:54:00:00:00:10

ovn-sbctl list port_binding sw0-vir
ovn-sbctl lflow-list lr0 | grep lr_in_gw_redirect

on ovn2.13-20.12.0-149.el7:

[root@wsfd-advnetlab16 bz1952961]# ovn-sbctl lflow-list lr0 | grep lr_in_gw_redirect                                                                              
  table=17(lr_in_gw_redirect  ), priority=100  , match=(ip4.src == 10.0.0.10 && outport == "lr0-public" && is_chassis_resident("sw0-vir")), action=(eth.src = 10:54:00:00:00:10; reg1 = 172.168.0.50; next;)
  table=17(lr_in_gw_redirect  ), priority=50   , match=(outport == "lr0-public"), action=(outport = "cr-lr0-public"; next;)
  table=17(lr_in_gw_redirect  ), priority=0    , match=(1), action=(next;) 

on ovn2.13-20.12.0-173.el7:

[root@wsfd-advnetlab16 bz1952961]# ovn-sbctl lflow-list lr0 | grep lr_in_gw_redirect
  table=17(lr_in_gw_redirect  ), priority=100  , match=(ip4.src == 10.0.0.10 && outport == "lr0-public" && is_chassis_resident("sw0-vir")), action=(eth.src = 10:54:00:00:00:10; reg1 = 172.168.0.50; next;)
  table=17(lr_in_gw_redirect  ), priority=80   , match=(ip4.src == 10.0.0.10 && outport == "lr0-public"), action=(drop;)

<=== one drop flow is added

  table=17(lr_in_gw_redirect  ), priority=50   , match=(outport == "lr0-public"), action=(outport = "cr-lr0-public"; next;)
  table=17(lr_in_gw_redirect  ), priority=0    , match=(1), action=(next;)


We can verify that the drop rule is added in the latest ovn version. but we can't reproduce the initial issue described in the Description. jlibosva, could you help to test with ovn2.13-20.12.0-173.el7 located at http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn2.13/20.12.0/173.el7fdp/? thanks

Comment 9 Jianlin Shi 2021-08-19 01:25:36 UTC
also verified on ovn2.13-20.12.0-173.el8:

+ ovn-sbctl list port_binding sw0-vir
_uuid               : 417656bd-8669-411c-8e2f-36a61d431e27                                            
chassis             : []                                                                              
datapath            : 6e8294d3-1693-4267-8d71-851ada3eba52                                            
encap               : []                                                                              
external_ids        : {}
gateway_chassis     : []
ha_chassis_group    : []                                                                              
logical_port        : sw0-vir
mac                 : ["50:54:00:00:00:10 10.0.0.10"]                                                 
nat_addresses       : []                                                                              
options             : {virtual-ip="10.0.0.10", virtual-parents="sw0-p1,sw0-p2"}                       
parent_port         : []
tag                 : []
tunnel_key          : 1
type                : virtual                                                                         
up                  : false                                                                           
virtual_parent      : []
+ ovn-sbctl lflow-list lr0
+ grep lr_in_gw_redirect
  table=17(lr_in_gw_redirect  ), priority=100  , match=(ip4.src == 10.0.0.10 && outport == "lr0-public" && is_chassis_resident("sw0-vir")), action=(eth.src = 10:54:00:00:00:10; reg1 = 172.168.0.50; next;)
  table=17(lr_in_gw_redirect  ), priority=80   , match=(ip4.src == 10.0.0.10 && outport == "lr0-public"), action=(drop;)
  table=17(lr_in_gw_redirect  ), priority=50   , match=(outport == "lr0-public"), action=(outport = "cr-lr0-public"; next;)
  table=17(lr_in_gw_redirect  ), priority=0    , match=(1), action=(next;)
[root@dell-per740-12 bz1952961]# rpm -qa | grep ovn2.13
ovn2.13-20.12.0-173.el8fdp.x86_64
ovn2.13-host-20.12.0-173.el8fdp.x86_64
ovn2.13-central-20.12.0-173.el8fdp.x86_64

Comment 10 Jianlin Shi 2021-08-19 01:28:27 UTC
also verified on ovn-2021-20.06.0-18.el8:

+ ovn-sbctl list port_binding sw0-vir
_uuid               : e0a99b66-2f9f-4bb5-b3af-9d9e0d8ede3a
chassis             : []
datapath            : 96f6365d-87ed-4167-9f56-7dde99e82d37
encap               : []
external_ids        : {}
gateway_chassis     : []
ha_chassis_group    : []
logical_port        : sw0-vir
mac                 : ["50:54:00:00:00:10 10.0.0.10"]
nat_addresses       : []
options             : {virtual-ip="10.0.0.10", virtual-parents="sw0-p1,sw0-p2"}
parent_port         : []
tag                 : []
tunnel_key          : 1
type                : virtual
up                  : false
virtual_parent      : []
+ ovn-sbctl lflow-list lr0
+ grep lr_in_gw_redirect
  table=17(lr_in_gw_redirect  ), priority=100  , match=(ip4.src == 10.0.0.10 && outport == "lr0-public" && is_chassis_resident("sw0-vir")), action=(eth.src = 10:54:00:00:00:10; reg1 = 172.168.0.50; next;)
  table=17(lr_in_gw_redirect  ), priority=80   , match=(ip4.src == 10.0.0.10 && outport == "lr0-public"), action=(drop;)
  table=17(lr_in_gw_redirect  ), priority=50   , match=(outport == "lr0-public"), action=(outport = "cr-lr0-public"; next;)
  table=17(lr_in_gw_redirect  ), priority=0    , match=(1), action=(next;)
[root@dell-per740-12 bz1952961]# rpm -qa | grep -E "openvswitch2.15|ovn-2021"
ovn-2021-21.06.0-18.el8fdp.x86_64
openvswitch2.15-2.15.0-35.el8fdp.x86_64
ovn-2021-central-21.06.0-18.el8fdp.x86_64
ovn-2021-host-21.06.0-18.el8fdp.x86_64

Comment 11 Jakub Libosvar 2021-12-22 20:18:47 UTC
(In reply to Jianlin Shi from comment #8)
> 
> We can verify that the drop rule is added in the latest ovn version. but we
> can't reproduce the initial issue described in the Description.
> jlibosva, could you help to test with ovn2.13-20.12.0-173.el7
> located at
> http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn2.13/20.12.0/
> 173.el7fdp/? thanks

I will clone this BZ to OpenStack and we will verify it.

Comment 16 errata-xmlrpc 2022-12-15 00:21:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:9044

Comment 17 Red Hat Bugzilla 2023-09-18 00:26:00 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.