Description of problem: Customer reported various issues related to OVN performance in ML2/OVN with DVR RHOSP 16.2 deployment : - VMs communications are sporadically affected in different ways - Nova fails to start VMs because of timed out/failed ovs-vsctl calls In OVN controller logs on compute nodes we can see messages like: 2023-05-04T00:07:56.014Z|689993|main|INFO|OVNSB commit failed, force recompute next time. 2023-05-04T00:08:07.938Z|689994|timeval|WARN|Unreasonably long 11924ms poll interval (11590ms user, 281ms system) 2023-05-04T00:08:07.938Z|689995|timeval|WARN|faults: 331861 minor, 0 major 2023-05-04T00:08:07.938Z|689996|timeval|WARN|context switches: 0 voluntary, 87 involuntary ... 2023-05-04T00:08:08.758Z|690003|poll_loop|INFO|wakeup due to 0-ms timeout at controller/mac-learn.c:97 (88% CPU usage) 2023-05-04T00:08:08.760Z|690004|poll_loop|INFO|wakeup due to 1-ms timeout at controller/mac-learn.c:97 (88% CPU usage) 2023-05-04T00:08:08.764Z|690005|poll_loop|INFO|wakeup due to 3-ms timeout at controller/mac-learn.c:97 (88% CPU usage) AND 2023-05-02T00:05:37.808Z|585744|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"MAC_Binding\" table to have identical values (lrp-UUID1 and \"SOME_IPV6_ADDR\") for index on columns \"logical_port\" and \"ip\". First row, with UUID UUID2, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID UUID3, was inserted by this transaction.","error":"constraint violation"} Previously customer already complained about MAC_Binding size and was one of users of https://access.redhat.com/solutions/6410121. It helped and significantly improved performance issues, but didn't resolve them completely. Customer told us that performance problems go away after MAC_Binding table cleanup and is unsure why it is still a problem after solution is applied. I took a look at OVN DBs and it looks like always_learn_from_arp_request and dynamic_neigh_routers flags set properly. So there should be something else. From provided OVN DBs it looks like customer has more than 320 thousand entries in FDB and it doesn't look like normal situation. Version-Release number of selected component (if applicable): RHOSP 16.2.4 How reproducible: Happens sporadically under heavy load (customer created and deletes numerous VMs daily)
About cleanup procedure for MAC Binding records. I don't have a script ready, but some pointers: 1. First, you can collect uuids of all mac_binding records that have their ip starting with `fe80::`. Sadly, `ovn-sbctl find` command doesn't seem to support a STARTSWITH predicate, so you will have to combine 'ovn-sbctl --columns=ip,_uuid list MAC_Binding` and `awk` / `grep` to capture just the UUIDs for the offending MAC bindings. 2. Then you do: `ovn-sbctl destroy MAC_Binding <uuid>` for each of the offending UUIDs. Notes: - you can combine `destroy` calls by listing multiple <uuid>s at the end of the command. - I don't think `ovn-sbctl` supports paging for listing records, so the output of the command may be huge.