Description of problem: We have an OCP environment with VIP that bounces around when failover happens. This VIP is associated with a FIP and the environment is DVR. In busy environment, it may take some time for ovn-controller to recompute the openflows when the virtual parent changes. We see the traffic becomes centralized and goes through the gw chassis for some time - about 17 seconds. When there is an established TCP connection, the connection starts to send out packets with source MAC of the FIP and the fabric learns the switch is on the gw chassis and there is a race between garps and the tcp traffic. Then the switch plugged to gw node learns the FIP is there and not on the compute node hosting the instance. Version-Release number of selected component (if applicable): ovn2.13-20.12.0-104.el8fdp.x86_64 How reproducible: Always on busy environment Steps to Reproduce: 1. Establish TCP connection to the FIP 2. Run tcpdump with source mac of FIP on the gateway node 3. Do failover of VIP associated with the FIP Actual results: Traffic goes through the gateway node Expected results: Traffic is always distributed and changes to new node once everything is set Additional info:
Just to emphasise the outcome - the FIP becomes unreachable for some time until switches learn the right port where the mac is. Is it possible that some flows are removed when OVN claims the virtual port and virtual parents are updated - that the traffic becomes centralized by mistake because of the way flows are matched?
http://patchwork.ozlabs.org/project/ovn/patch/44d35e75854001f0eb2ad4127c2a26b1f6a6b8f8.1624362145.git.lorenzo.bianconi@redhat.com/
Tested with following script: systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:1.1.170.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.170.25 systemctl restart ovn-controller ovs-vsctl add-br br-public ovs-vsctl set open . external-ids:ovn-bridge-mappings=public:br-public ovs-vsctl add-port br-public p1p2 ovn-nbctl ls-add sw0 ovn-nbctl lsp-add sw0 sw0-vir ovn-nbctl lsp-set-addresses sw0-vir "50:54:00:00:00:10 10.0.0.10" ovn-nbctl lsp-set-port-security sw0-vir "50:54:00:00:00:10 10.0.0.10" ovn-nbctl lsp-set-type sw0-vir virtual ovn-nbctl set logical_switch_port sw0-vir options:virtual-ip=10.0.0.10 ovn-nbctl set logical_switch_port sw0-vir options:virtual-parents=sw0-p1,sw0-p2 ovn-nbctl lsp-add sw0 sw0-p1 ovn-nbctl lsp-set-addresses sw0-p1 "50:54:00:00:00:03 10.0.0.3" ovn-nbctl lsp-add sw0 sw0-p2 ovn-nbctl lsp-set-addresses sw0-p2 "50:54:00:00:00:04 10.0.0.4" ovn-nbctl lr-add lr0 ovn-nbctl lrp-add lr0 lr0-sw0 00:00:00:00:ff:01 10.0.0.1/24 ovn-nbctl lsp-add sw0 sw0-lr0 ovn-nbctl lsp-set-type sw0-lr0 router ovn-nbctl lsp-set-addresses sw0-lr0 00:00:00:00:ff:01 ovn-nbctl lsp-set-options sw0-lr0 router-port=lr0-sw0 ovn-nbctl ls-add public ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24 ovn-nbctl lsp-add public public-lr0 ovn-nbctl lsp-set-type public-lr0 router ovn-nbctl lsp-set-addresses public-lr0 router ovn-nbctl lsp-set-options public-lr0 router-port=lr0-public ovn-nbctl lsp-add public ln-public ovn-nbctl lsp-set-type ln-public localnet ovn-nbctl lsp-set-addresses ln-public unknown ovn-nbctl lsp-set-options ln-public network_name=public ovn-nbctl --wait=hv lrp-set-gateway-chassis lr0-public hv1 20 ovn-nbctl lr-nat-add lr0 dnat_and_snat 172.168.0.50 10.0.0.10 sw0-vir 10:54:00:00:00:10 ovn-sbctl list port_binding sw0-vir ovn-sbctl lflow-list lr0 | grep lr_in_gw_redirect on ovn2.13-20.12.0-149.el7: [root@wsfd-advnetlab16 bz1952961]# ovn-sbctl lflow-list lr0 | grep lr_in_gw_redirect table=17(lr_in_gw_redirect ), priority=100 , match=(ip4.src == 10.0.0.10 && outport == "lr0-public" && is_chassis_resident("sw0-vir")), action=(eth.src = 10:54:00:00:00:10; reg1 = 172.168.0.50; next;) table=17(lr_in_gw_redirect ), priority=50 , match=(outport == "lr0-public"), action=(outport = "cr-lr0-public"; next;) table=17(lr_in_gw_redirect ), priority=0 , match=(1), action=(next;) on ovn2.13-20.12.0-173.el7: [root@wsfd-advnetlab16 bz1952961]# ovn-sbctl lflow-list lr0 | grep lr_in_gw_redirect table=17(lr_in_gw_redirect ), priority=100 , match=(ip4.src == 10.0.0.10 && outport == "lr0-public" && is_chassis_resident("sw0-vir")), action=(eth.src = 10:54:00:00:00:10; reg1 = 172.168.0.50; next;) table=17(lr_in_gw_redirect ), priority=80 , match=(ip4.src == 10.0.0.10 && outport == "lr0-public"), action=(drop;) <=== one drop flow is added table=17(lr_in_gw_redirect ), priority=50 , match=(outport == "lr0-public"), action=(outport = "cr-lr0-public"; next;) table=17(lr_in_gw_redirect ), priority=0 , match=(1), action=(next;) We can verify that the drop rule is added in the latest ovn version. but we can't reproduce the initial issue described in the Description. jlibosva, could you help to test with ovn2.13-20.12.0-173.el7 located at http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn2.13/20.12.0/173.el7fdp/? thanks
also verified on ovn2.13-20.12.0-173.el8: + ovn-sbctl list port_binding sw0-vir _uuid : 417656bd-8669-411c-8e2f-36a61d431e27 chassis : [] datapath : 6e8294d3-1693-4267-8d71-851ada3eba52 encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : [] logical_port : sw0-vir mac : ["50:54:00:00:00:10 10.0.0.10"] nat_addresses : [] options : {virtual-ip="10.0.0.10", virtual-parents="sw0-p1,sw0-p2"} parent_port : [] tag : [] tunnel_key : 1 type : virtual up : false virtual_parent : [] + ovn-sbctl lflow-list lr0 + grep lr_in_gw_redirect table=17(lr_in_gw_redirect ), priority=100 , match=(ip4.src == 10.0.0.10 && outport == "lr0-public" && is_chassis_resident("sw0-vir")), action=(eth.src = 10:54:00:00:00:10; reg1 = 172.168.0.50; next;) table=17(lr_in_gw_redirect ), priority=80 , match=(ip4.src == 10.0.0.10 && outport == "lr0-public"), action=(drop;) table=17(lr_in_gw_redirect ), priority=50 , match=(outport == "lr0-public"), action=(outport = "cr-lr0-public"; next;) table=17(lr_in_gw_redirect ), priority=0 , match=(1), action=(next;) [root@dell-per740-12 bz1952961]# rpm -qa | grep ovn2.13 ovn2.13-20.12.0-173.el8fdp.x86_64 ovn2.13-host-20.12.0-173.el8fdp.x86_64 ovn2.13-central-20.12.0-173.el8fdp.x86_64
also verified on ovn-2021-20.06.0-18.el8: + ovn-sbctl list port_binding sw0-vir _uuid : e0a99b66-2f9f-4bb5-b3af-9d9e0d8ede3a chassis : [] datapath : 96f6365d-87ed-4167-9f56-7dde99e82d37 encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : [] logical_port : sw0-vir mac : ["50:54:00:00:00:10 10.0.0.10"] nat_addresses : [] options : {virtual-ip="10.0.0.10", virtual-parents="sw0-p1,sw0-p2"} parent_port : [] tag : [] tunnel_key : 1 type : virtual up : false virtual_parent : [] + ovn-sbctl lflow-list lr0 + grep lr_in_gw_redirect table=17(lr_in_gw_redirect ), priority=100 , match=(ip4.src == 10.0.0.10 && outport == "lr0-public" && is_chassis_resident("sw0-vir")), action=(eth.src = 10:54:00:00:00:10; reg1 = 172.168.0.50; next;) table=17(lr_in_gw_redirect ), priority=80 , match=(ip4.src == 10.0.0.10 && outport == "lr0-public"), action=(drop;) table=17(lr_in_gw_redirect ), priority=50 , match=(outport == "lr0-public"), action=(outport = "cr-lr0-public"; next;) table=17(lr_in_gw_redirect ), priority=0 , match=(1), action=(next;) [root@dell-per740-12 bz1952961]# rpm -qa | grep -E "openvswitch2.15|ovn-2021" ovn-2021-21.06.0-18.el8fdp.x86_64 openvswitch2.15-2.15.0-35.el8fdp.x86_64 ovn-2021-central-21.06.0-18.el8fdp.x86_64 ovn-2021-host-21.06.0-18.el8fdp.x86_64
(In reply to Jianlin Shi from comment #8) > > We can verify that the drop rule is added in the latest ovn version. but we > can't reproduce the initial issue described in the Description. > jlibosva, could you help to test with ovn2.13-20.12.0-173.el7 > located at > http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn2.13/20.12.0/ > 173.el7fdp/? thanks I will clone this BZ to OpenStack and we will verify it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:9044
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days