When using ml2-ovn on a test topology [0], a router with external gateway port gets created shortly after cluster is deployed. Since BFD have not had enough time to stabilize, we observe that the OVN controllers fight over the LRP at a pretty high rate. In some tests, we see the brawl flipping the LRP ~100 times during the short seconds it takes for BFD to indirectly settle the dispute. While that lasts just a few seconds, there is no dampening in ovsdb notifications and the monitoring in ml2/ovn queues up all these notifications and begins to process them one by one, racking up a lot of updates to the lsp and lrp rows that were affected. Dealing with that takes a whole lot longer (~1.5 minute). So the ripple effect of the notifications burst is something we should avoid, if possible. See below for an example of the controllers fighting over the LRP [1]. Note that as far as the NB db is concerned, the Gateway_Chassis had been configured with the expected priorities in a single transaction, so this is not happening due to changes in that table [2]. A fallout issue, which is tracked in a separate issue (see bz1728282), is that the router is removed in the test before the notification burst gets fully 'drained' by neutron. That causes ovsdbapp notifications to refer to row objects that no longer exist, which causes errors that look like: Unexpected exception in notify_loop: AttributeError: 'Atom' object has no attribute 'external_ids' To reproduce this issue, see comment 28 in https://bugzilla.redhat.com/show_bug.cgi?id=1728282#c28 [0]: tempest run --regex neutron_tempest_plugin.api.admin.test_external_network_extension.ExternalNetworksRBACTestJSON.test_delete_policies_while_tenant_attached_to_net [1]: http://pastebin.test.redhat.com/973584 [2]: http://pastebin.test.redhat.com/973585 [1]: http://pastebin.test.redhat.com/973584 ``` [root@controller-2]# tail -F /var/log/containers/openvswitch/ovn-controller.log | grep binding 2021-06-21T21:28:45.825Z|18183|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7. 2021-06-21T21:28:45.825Z|18184|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.831Z|18185|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7. 2021-06-21T21:28:45.831Z|18186|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.836Z|18187|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7. 2021-06-21T21:28:45.836Z|18188|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.840Z|18189|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7. 2021-06-21T21:28:45.840Z|18190|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.845Z|18191|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7. 2021-06-21T21:28:45.845Z|18192|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.849Z|18193|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7. 2021-06-21T21:28:45.849Z|18194|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.854Z|18195|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7. 2021-06-21T21:28:45.854Z|18196|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.857Z|18197|binding|INFO|Releasing lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from this chassis. ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── [root@controller-1]# tail -F /var/log/containers/openvswitch/ovn-controller.log | grep binding 2021-06-21T21:28:45.917Z|11684|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86. 2021-06-21T21:28:45.917Z|11685|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.922Z|11686|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86. 2021-06-21T21:28:45.922Z|11687|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.926Z|11688|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86. 2021-06-21T21:28:45.926Z|11689|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.931Z|11690|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86. 2021-06-21T21:28:45.931Z|11691|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.936Z|11692|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86. 2021-06-21T21:28:45.936Z|11693|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.940Z|11694|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86. 2021-06-21T21:28:45.940Z|11695|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.945Z|11696|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86. 2021-06-21T21:28:45.945Z|11697|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.950Z|11698|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86. 2021-06-21T21:28:45.950Z|11699|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.954Z|11700|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86. 2021-06-21T21:28:45.954Z|11701|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.960Z|11702|binding|INFO|Claiming lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f for this chassis. 2021-06-21T21:28:45.960Z|11703|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── [root@controller-0]# tail -F /var/log/containers/openvswitch/ovn-controller.log | grep binding 2021-06-21T21:28:45.766Z|15783|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e. 2021-06-21T21:28:45.766Z|15784|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.773Z|15785|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e. 2021-06-21T21:28:45.773Z|15786|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.778Z|15787|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e. 2021-06-21T21:28:45.778Z|15788|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.782Z|15789|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e. 2021-06-21T21:28:45.782Z|15790|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.786Z|15791|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e. 2021-06-21T21:28:45.786Z|15792|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.790Z|15793|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e. 2021-06-21T21:28:45.791Z|15794|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.795Z|15795|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e. 2021-06-21T21:28:45.795Z|15796|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28 2021-06-21T21:28:45.797Z|15797|binding|INFO|Releasing lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from this chassis. ``` [2]: http://pastebin.test.redhat.com/973585 ``` record 570: 2021-06-21 21:28:44.887 table Logical_Router_Static_Route insert row 8c037eeb: ip_prefix="0.0.0.0/0" nexthop="10.100.0.1" external_ids={"neutron:is_ext_gw"="true", "neutron:subnet_id"="8cafef4c-aed1-4730-92d9-4fa05a149376"} table Logical_Router insert row "neutron-8f696771-a82f-463f-8c7b-d1ec24f6f90d" (def11392): name=neutron-8f696771-a82f-463f-8c7b-d1ec24f6f90d static_routes=[8c037eeb-3231-4530-9ea4-c8654e235353] ports=[ee23d25e-18e2-4a66-8a45-629f3293f08d] external_ids={"neutron:availability_zone_hints"="", "neutron:gw_port_id"="6a4303dd-ef63-45c7-8591-3ca869ca9c3f", "neutron:revision_number"="3", "neutron:router_name"=tempest-router-1440713011} enabled=true table Gateway_Chassis insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_2b62cf6f-df80-425a-b27a-73b3079cabce" (5020a5d2): name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_2b62cf6f-df80-425a-b27a-73b3079cabce priority=3 chassis_name="2b62cf6f-df80-425a-b27a-73b3079cabce" table Gateway_Chassis insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_da0969e0-5dba-4fc5-af71-b5351280042e" (2c056d9e): name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_da0969e0-5dba-4fc5-af71-b5351280042e priority=4 chassis_name="da0969e0-5dba-4fc5-af71-b5351280042e" table Gateway_Chassis insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_0fd5b282-a152-47e6-84a9-c3d5645ffe86" (6c621e79): name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_0fd5b282-a152-47e6-84a9-c3d5645ffe86 priority=5 chassis_name="0fd5b282-a152-47e6-84a9-c3d5645ffe86" table Gateway_Chassis insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_6fd42157-4e0a-4e5e-8dca-3640bf8d97af" (b78e509f): name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_6fd42157-4e0a-4e5e-8dca-3640bf8d97af priority=1 chassis_name="6fd42157-4e0a-4e5e-8dca-3640bf8d97af" table Gateway_Chassis insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_a16c2360-e86e-45ec-9223-103e9fe813c7" (ded95e9b): name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_a16c2360-e86e-45ec-9223-103e9fe813c7 priority=2 chassis_name="a16c2360-e86e-45ec-9223-103e9fe813c7" table Logical_Router_Port insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f" (ee23d25e): name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f mac="fa:16:3e:a8:ee:47" external_ids={"neutron:network_name"=neutron-f2d70641-bbf3-47e5-8a84-b4c10b6a7755, "neutron:revision_number"="1", "neutron:router_name"="8f696771-a82f-463f-8c7b-d1ec24f6f90d", "neutron:subnet_ids"="8cafef4c-aed1-4730-92d9-4fa05a149376"} gateway_chassis=[2c056d9e-d54c-41bf-9163-5e2521985a84, 5020a5d2-80b4-40f8-b1d3-e9b0a6440581, 6c621e79-cef0-4083-af02-d0a5ea634097, b78e509f-a639-4477-9632-eb5ed24158ff, ded95e9b-5d5a-49ae-9c57-a96e5b56e528] networks=["10.100.0.10/28"] table Logical_Switch_Port row "6a4303dd-ef63-45c7-8591-3ca869ca9c3f" (f74d1302): addresses=[router] options={nat-addresses=router, router-port=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f} type=router ```
We also seeing it with ovn-2021 in RHOSP-16.2 release https://bugzilla.redhat.com/show_bug.cgi?id=2081631
A discussion in irc suggested that one way of addressing the churn is to introduce a "delay" / backoff mechanism for re-claiming a port after a "recent" claim of the port by the same controller. (Terms "recent" and "delay" would be subject to discussion or maybe even configuration.) This should be doable by making each lport structure carry a timestamp of the latest successful claim.
Bumping the severity/priority due to 2081631 and also because one of the most frequent causes of issues we have in ml2/ovn end up being related to southbound ovsdb performance. When ovsdb-server has to processes 100s of transactions/sec while ports are fought over, that extra load can cause issues. In addition, since neutron subscribes to the SB Port_Binding, that means it has to process all of those events as well. So fixing this could be a pretty big deal, performance-wise.
Posted the fix upstream here: https://patchwork.ozlabs.org/project/ovn/list/?series=308808
@Terry this is fixed in master + I backported up to 22.03. Do we really need to see this in 2.13 and it can be left for FDP improvements?
Mark, please build a new FPD release for 22.03+ / 22.06+, this bug was fixed in 22.06 https://github.com/ovn-org/ovn/commit/8f1d63bbf6f67ab2bc4eb3d59ba1de43a4f6548f 22.03 https://github.com/ovn-org/ovn/commit/2c98163e024f0543d84df44f9c0840ce0347e2bc Thank you.