Bug 2203848

Summary: [OVN] Performance problems are breaking stack workflows and affecting VM's trafic
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: python-networking-ovnAssignee: Jakub Libosvar <jlibosva>
Status: MODIFIED --- QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: amusil, andeshmu, apevec, bcafarel, dalvarez, dhill, hakhande, ihrachys, jlibosva, ldenny, lhh, majopela, mlavalle, njohnston, scohen, shtiwari, ssigwald
Target Milestone: z5Keywords: TestOnly, Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2211240    
Bug Blocks:    

Description Alex Stupnikov 2023-05-15 12:48:13 UTC
Description of problem:

Customer reported various issues related to OVN performance in ML2/OVN with DVR RHOSP 16.2 deployment :

- VMs communications are sporadically affected in different ways
- Nova fails to start VMs because of timed out/failed ovs-vsctl calls

In OVN controller logs on compute nodes we can see messages like:

2023-05-04T00:07:56.014Z|689993|main|INFO|OVNSB commit failed, force recompute next time.
2023-05-04T00:08:07.938Z|689994|timeval|WARN|Unreasonably long 11924ms poll interval (11590ms user, 281ms system)
2023-05-04T00:08:07.938Z|689995|timeval|WARN|faults: 331861 minor, 0 major
2023-05-04T00:08:07.938Z|689996|timeval|WARN|context switches: 0 voluntary, 87 involuntary
...
2023-05-04T00:08:08.758Z|690003|poll_loop|INFO|wakeup due to 0-ms timeout at controller/mac-learn.c:97 (88% CPU usage)
2023-05-04T00:08:08.760Z|690004|poll_loop|INFO|wakeup due to 1-ms timeout at controller/mac-learn.c:97 (88% CPU usage)
2023-05-04T00:08:08.764Z|690005|poll_loop|INFO|wakeup due to 3-ms timeout at controller/mac-learn.c:97 (88% CPU usage)

AND

2023-05-02T00:05:37.808Z|585744|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"MAC_Binding\" table to have identical values (lrp-UUID1 and \"SOME_IPV6_ADDR\") for index on columns \"logical_port\" and \"ip\".  First row, with UUID UUID2, existed in the database before this transaction and was not modified by the transaction.  Second row, with UUID UUID3, was inserted by this transaction.","error":"constraint violation"}


Previously customer already complained about MAC_Binding size and was one of users of https://access.redhat.com/solutions/6410121. It helped and significantly improved performance issues, but didn't resolve them completely. Customer told us that performance problems go away after MAC_Binding table cleanup and is unsure why it is still a problem after solution is applied. I took a look at OVN DBs and it looks like always_learn_from_arp_request and dynamic_neigh_routers flags set properly. So there should be something else.

From provided OVN DBs it looks like customer has more than 320 thousand entries in FDB and it doesn't look like normal situation.


Version-Release number of selected component (if applicable):
RHOSP 16.2.4


How reproducible:
Happens sporadically under heavy load (customer created and deletes numerous VMs daily)

Comment 29 Ihar Hrachyshka 2023-05-31 16:10:40 UTC
About cleanup procedure for MAC Binding records. I don't have a script ready, but some pointers:

1. First, you can collect uuids of all mac_binding records that have their ip starting with `fe80::`. Sadly, `ovn-sbctl find` command doesn't seem to support a STARTSWITH predicate, so you will have to combine 'ovn-sbctl --columns=ip,_uuid list MAC_Binding` and `awk` / `grep` to capture just the UUIDs for the offending MAC bindings.

2. Then you do: `ovn-sbctl destroy MAC_Binding <uuid>` for each of the offending UUIDs.

Notes:
- you can combine `destroy` calls by listing multiple <uuid>s at the end of the command.
- I don't think `ovn-sbctl` supports paging for listing records, so the output of the command may be huge.