The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1974898 - tug-of-war between ovn-controllers for external gateway port causes havoc for ml2-ovn
Summary: tug-of-war between ovn-controllers for external gateway port causes havoc for...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn22.03
Version: FDP 20.H
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: OVN Team
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks: 1728282 1994427 2081631 2189267 2196286
TreeView+ depends on / blocked
 
Reported: 2021-06-22 18:06 UTC by ffernand
Modified: 2023-06-15 06:55 UTC (History)
9 users (show)

Fixed In Version: ovn22.03-22.03.0-95.el8fdp ovn22.03-22.03.0-95.el9fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2189267 2196286 (view as bug list)
Environment:
Last Closed: 2023-03-13 07:13:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-1386 0 None None None 2021-08-20 12:52:51 UTC

Description ffernand 2021-06-22 18:06:54 UTC
When using ml2-ovn on a test topology [0], a router with external gateway port gets created shortly after cluster is deployed.
Since BFD have not had enough time to stabilize, we observe that the OVN controllers fight over the LRP at a pretty
high rate. In some tests, we see the brawl flipping the LRP ~100 times during the short seconds it takes for BFD to indirectly
settle the dispute.

While that lasts just a few seconds, there is no dampening in ovsdb notifications and the monitoring in ml2/ovn queues
up all these notifications and begins to process them one by one, racking up a lot of updates to the lsp and lrp rows that
were affected. Dealing with that takes a whole lot longer (~1.5 minute). So the ripple effect of the notifications burst is
something we should avoid, if possible.

See below for an example of the controllers fighting over the LRP [1]. Note that as far as the NB db is concerned, the
Gateway_Chassis had been configured with the expected priorities in a single transaction, so this is not happening
due to changes in that table [2].

A fallout issue, which is tracked in a separate issue (see bz1728282), is that the router is removed in the test before the
notification burst gets fully 'drained' by neutron. That causes ovsdbapp notifications to refer to row objects that no longer
exist, which causes errors that look like: 

    Unexpected exception in notify_loop: AttributeError: 'Atom' object has no attribute 'external_ids'

To reproduce this issue, see comment 28 in https://bugzilla.redhat.com/show_bug.cgi?id=1728282#c28

[0]: tempest run --regex neutron_tempest_plugin.api.admin.test_external_network_extension.ExternalNetworksRBACTestJSON.test_delete_policies_while_tenant_attached_to_net
[1]: http://pastebin.test.redhat.com/973584
[2]: http://pastebin.test.redhat.com/973585


[1]: http://pastebin.test.redhat.com/973584
```
[root@controller-2]# tail -F /var/log/containers/openvswitch/ovn-controller.log | grep binding

2021-06-21T21:28:45.825Z|18183|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7.
2021-06-21T21:28:45.825Z|18184|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.831Z|18185|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7.
2021-06-21T21:28:45.831Z|18186|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.836Z|18187|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7.
2021-06-21T21:28:45.836Z|18188|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.840Z|18189|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7.
2021-06-21T21:28:45.840Z|18190|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.845Z|18191|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7.
2021-06-21T21:28:45.845Z|18192|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.849Z|18193|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7.
2021-06-21T21:28:45.849Z|18194|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.854Z|18195|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to a16c2360-e86e-45ec-9223-103e9fe813c7.
2021-06-21T21:28:45.854Z|18196|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.857Z|18197|binding|INFO|Releasing lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from this chassis.

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

[root@controller-1]# tail -F /var/log/containers/openvswitch/ovn-controller.log | grep binding

2021-06-21T21:28:45.917Z|11684|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86.
2021-06-21T21:28:45.917Z|11685|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.922Z|11686|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86.
2021-06-21T21:28:45.922Z|11687|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.926Z|11688|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86.
2021-06-21T21:28:45.926Z|11689|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.931Z|11690|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86.
2021-06-21T21:28:45.931Z|11691|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.936Z|11692|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86.
2021-06-21T21:28:45.936Z|11693|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.940Z|11694|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86.
2021-06-21T21:28:45.940Z|11695|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.945Z|11696|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86.
2021-06-21T21:28:45.945Z|11697|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.950Z|11698|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86.
2021-06-21T21:28:45.950Z|11699|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.954Z|11700|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 2b62cf6f-df80-425a-b27a-73b3079cabce to 0fd5b282-a152-47e6-84a9-c3d5645ffe86.
2021-06-21T21:28:45.954Z|11701|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.960Z|11702|binding|INFO|Claiming lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f for this chassis.
2021-06-21T21:28:45.960Z|11703|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

[root@controller-0]# tail -F /var/log/containers/openvswitch/ovn-controller.log | grep binding

2021-06-21T21:28:45.766Z|15783|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e.
2021-06-21T21:28:45.766Z|15784|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.773Z|15785|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e.
2021-06-21T21:28:45.773Z|15786|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.778Z|15787|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e.
2021-06-21T21:28:45.778Z|15788|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.782Z|15789|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e.
2021-06-21T21:28:45.782Z|15790|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.786Z|15791|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e.
2021-06-21T21:28:45.786Z|15792|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.790Z|15793|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e.
2021-06-21T21:28:45.791Z|15794|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.795Z|15795|binding|INFO|Changing chassis for lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from 0fd5b282-a152-47e6-84a9-c3d5645ffe86 to da0969e0-5dba-4fc5-af71-b5351280042e.
2021-06-21T21:28:45.795Z|15796|binding|INFO|cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f: Claiming fa:16:3e:a8:ee:47 10.100.0.10/28
2021-06-21T21:28:45.797Z|15797|binding|INFO|Releasing lport cr-lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f from this chassis.
```

[2]: http://pastebin.test.redhat.com/973585
```
record 570: 2021-06-21 21:28:44.887
  table Logical_Router_Static_Route insert row 8c037eeb:
    ip_prefix="0.0.0.0/0"
    nexthop="10.100.0.1"
    external_ids={"neutron:is_ext_gw"="true", "neutron:subnet_id"="8cafef4c-aed1-4730-92d9-4fa05a149376"}
  table Logical_Router insert row "neutron-8f696771-a82f-463f-8c7b-d1ec24f6f90d" (def11392):
    name=neutron-8f696771-a82f-463f-8c7b-d1ec24f6f90d
    static_routes=[8c037eeb-3231-4530-9ea4-c8654e235353]
    ports=[ee23d25e-18e2-4a66-8a45-629f3293f08d]
    external_ids={"neutron:availability_zone_hints"="", "neutron:gw_port_id"="6a4303dd-ef63-45c7-8591-3ca869ca9c3f", "neutron:revision_number"="3", "neutron:router_name"=tempest-router-1440713011}
    enabled=true
  table Gateway_Chassis insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_2b62cf6f-df80-425a-b27a-73b3079cabce" (5020a5d2):
    name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_2b62cf6f-df80-425a-b27a-73b3079cabce
    priority=3
    chassis_name="2b62cf6f-df80-425a-b27a-73b3079cabce"
  table Gateway_Chassis insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_da0969e0-5dba-4fc5-af71-b5351280042e" (2c056d9e):
    name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_da0969e0-5dba-4fc5-af71-b5351280042e
    priority=4
    chassis_name="da0969e0-5dba-4fc5-af71-b5351280042e"
  table Gateway_Chassis insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_0fd5b282-a152-47e6-84a9-c3d5645ffe86" (6c621e79):
    name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_0fd5b282-a152-47e6-84a9-c3d5645ffe86
    priority=5
    chassis_name="0fd5b282-a152-47e6-84a9-c3d5645ffe86"
  table Gateway_Chassis insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_6fd42157-4e0a-4e5e-8dca-3640bf8d97af" (b78e509f):
    name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_6fd42157-4e0a-4e5e-8dca-3640bf8d97af
    priority=1
    chassis_name="6fd42157-4e0a-4e5e-8dca-3640bf8d97af"
  table Gateway_Chassis insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_a16c2360-e86e-45ec-9223-103e9fe813c7" (ded95e9b):
    name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f_a16c2360-e86e-45ec-9223-103e9fe813c7
    priority=2
    chassis_name="a16c2360-e86e-45ec-9223-103e9fe813c7"
  table Logical_Router_Port insert row "lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f" (ee23d25e):
    name=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f
    mac="fa:16:3e:a8:ee:47"
    external_ids={"neutron:network_name"=neutron-f2d70641-bbf3-47e5-8a84-b4c10b6a7755, "neutron:revision_number"="1", "neutron:router_name"="8f696771-a82f-463f-8c7b-d1ec24f6f90d", "neutron:subnet_ids"="8cafef4c-aed1-4730-92d9-4fa05a149376"}
    gateway_chassis=[2c056d9e-d54c-41bf-9163-5e2521985a84, 5020a5d2-80b4-40f8-b1d3-e9b0a6440581, 6c621e79-cef0-4083-af02-d0a5ea634097, b78e509f-a639-4477-9632-eb5ed24158ff, ded95e9b-5d5a-49ae-9c57-a96e5b56e528]
    networks=["10.100.0.10/28"]
  table Logical_Switch_Port row "6a4303dd-ef63-45c7-8591-3ca869ca9c3f" (f74d1302):
    addresses=[router]
    options={nat-addresses=router, router-port=lrp-6a4303dd-ef63-45c7-8591-3ca869ca9c3f}
    type=router
```

Comment 4 Yatin Karel 2022-05-25 14:56:57 UTC
We also seeing it with ovn-2021 in RHOSP-16.2 release https://bugzilla.redhat.com/show_bug.cgi?id=2081631

Comment 5 Ihar Hrachyshka 2022-05-25 16:21:40 UTC
A discussion in irc suggested that one way of addressing the churn is to introduce a "delay" / backoff mechanism for re-claiming a port after a "recent" claim of the port by the same controller. (Terms "recent" and "delay" would be subject to discussion or maybe even configuration.) This should be doable by making each lport structure carry a timestamp of the latest successful claim.

Comment 6 Terry Wilson 2022-05-25 17:18:04 UTC
Bumping the severity/priority due to 2081631 and also because one of the most frequent causes of issues we have in ml2/ovn end up being related to southbound ovsdb performance. When ovsdb-server has to processes 100s of transactions/sec while ports are fought over, that extra load can cause issues. In addition, since neutron subscribes to the SB Port_Binding, that means it has to process all of those events as well. So fixing this could be a pretty big deal, performance-wise.

Comment 7 Ihar Hrachyshka 2022-07-09 05:34:22 UTC
Posted the fix upstream here: https://patchwork.ozlabs.org/project/ovn/list/?series=308808

Comment 8 Ihar Hrachyshka 2022-08-22 18:52:00 UTC
@Terry this is fixed in master + I backported up to 22.03. Do we really need to see this in 2.13 and it can be left for FDP improvements?

Comment 9 Ihar Hrachyshka 2022-08-23 16:24:15 UTC
Mark,

please build a new FPD release for 22.03+ / 22.06+, this bug was fixed in 

22.06 https://github.com/ovn-org/ovn/commit/8f1d63bbf6f67ab2bc4eb3d59ba1de43a4f6548f
22.03 https://github.com/ovn-org/ovn/commit/2c98163e024f0543d84df44f9c0840ce0347e2bc

Thank you.


Note You need to log in before you can comment on or make changes to this bug.