Bug 2203848 - [OVN] Performance problems are breaking stack workflows and affecting VM's trafic
Summary: [OVN] Performance problems are breaking stack workflows and affecting VM's tr...
Keywords:
Status: MODIFIED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 16.2 (Train)
Hardware: All
OS: All
medium
high
Target Milestone: z5
: 16.2 (Train on RHEL 8.4)
Assignee: Jakub Libosvar
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On: 2211240
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-15 12:48 UTC by Alex Stupnikov
Modified: 2023-07-28 12:54 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-25066 0 None None None 2023-05-15 12:49:50 UTC

Description Alex Stupnikov 2023-05-15 12:48:13 UTC
Description of problem:

Customer reported various issues related to OVN performance in ML2/OVN with DVR RHOSP 16.2 deployment :

- VMs communications are sporadically affected in different ways
- Nova fails to start VMs because of timed out/failed ovs-vsctl calls

In OVN controller logs on compute nodes we can see messages like:

2023-05-04T00:07:56.014Z|689993|main|INFO|OVNSB commit failed, force recompute next time.
2023-05-04T00:08:07.938Z|689994|timeval|WARN|Unreasonably long 11924ms poll interval (11590ms user, 281ms system)
2023-05-04T00:08:07.938Z|689995|timeval|WARN|faults: 331861 minor, 0 major
2023-05-04T00:08:07.938Z|689996|timeval|WARN|context switches: 0 voluntary, 87 involuntary
...
2023-05-04T00:08:08.758Z|690003|poll_loop|INFO|wakeup due to 0-ms timeout at controller/mac-learn.c:97 (88% CPU usage)
2023-05-04T00:08:08.760Z|690004|poll_loop|INFO|wakeup due to 1-ms timeout at controller/mac-learn.c:97 (88% CPU usage)
2023-05-04T00:08:08.764Z|690005|poll_loop|INFO|wakeup due to 3-ms timeout at controller/mac-learn.c:97 (88% CPU usage)

AND

2023-05-02T00:05:37.808Z|585744|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"MAC_Binding\" table to have identical values (lrp-UUID1 and \"SOME_IPV6_ADDR\") for index on columns \"logical_port\" and \"ip\".  First row, with UUID UUID2, existed in the database before this transaction and was not modified by the transaction.  Second row, with UUID UUID3, was inserted by this transaction.","error":"constraint violation"}


Previously customer already complained about MAC_Binding size and was one of users of https://access.redhat.com/solutions/6410121. It helped and significantly improved performance issues, but didn't resolve them completely. Customer told us that performance problems go away after MAC_Binding table cleanup and is unsure why it is still a problem after solution is applied. I took a look at OVN DBs and it looks like always_learn_from_arp_request and dynamic_neigh_routers flags set properly. So there should be something else.

From provided OVN DBs it looks like customer has more than 320 thousand entries in FDB and it doesn't look like normal situation.


Version-Release number of selected component (if applicable):
RHOSP 16.2.4


How reproducible:
Happens sporadically under heavy load (customer created and deletes numerous VMs daily)

Comment 29 Ihar Hrachyshka 2023-05-31 16:10:40 UTC
About cleanup procedure for MAC Binding records. I don't have a script ready, but some pointers:

1. First, you can collect uuids of all mac_binding records that have their ip starting with `fe80::`. Sadly, `ovn-sbctl find` command doesn't seem to support a STARTSWITH predicate, so you will have to combine 'ovn-sbctl --columns=ip,_uuid list MAC_Binding` and `awk` / `grep` to capture just the UUIDs for the offending MAC bindings.

2. Then you do: `ovn-sbctl destroy MAC_Binding <uuid>` for each of the offending UUIDs.

Notes:
- you can combine `destroy` calls by listing multiple <uuid>s at the end of the command.
- I don't think `ovn-sbctl` supports paging for listing records, so the output of the command may be huge.


Note You need to log in before you can comment on or make changes to this bug.