Currently ovsdb-server is mostly single-threaded, only some operations (like disk sync) are performed asynchronously. At the same time, the database compaction (a.k.a. snapshot creation) is performed by the main thread and it's also fairly heavy. It's not a very frequent operation in a real-world setup, but it may cause latency spikes from a few to a dozen of seconds for control operations on a fairly big OVN cluster. And that is inconvenient, especially for performance and scale testing where lots of operations are executed at a very high rate, forcing the database to compact itself frequently. In order to avoid such latency, ovsdb-server in clustered mode transfers the raft leadership before starting compaction, but that only helps for leader-only connections such as CMS connections to the NbDB or ovn-northd connections to databases. ovn-controllers doesn't follow the leader and will wait for the SbDB to finish the compaction until they can send a port state update or receive an updated configuration. The situation can be improved in a few different ways (e.g. by using leader-only relays as a frontend for SbDB), this BZ is about one very specific way of moving some parts of the compaction process out of the main thread, allowing it to continue to serve clients and execute transactions. The main idea is to move database-to-json conversion to a separate thread. The main problem here is the data availability, as ovsdb data structures are not generally thread-safe, they can not be accessed by different threads, while the main thread is chanign them by executing transactions. This can be worked around by cloning the required objects first in the main thread and handing them over to the worker thread. The key is that cloning itself should be fast for this to make any sense. 2 possible implementations: 1. RAFT module has a log of all the database operations in a JSON format, as well as the previous database snapshot. And, conveniently, JSON objects support shallow copies. That means that copy of a raft log can be created fairly quickly by the main thread, handed over to the worker thread that will create a new snapshot from that data. Pros: - all prerequisites are already in place. Cons: - worker will have to parse all JSON objects back to database objects and, basically, re-play all the transactions in order to construct a new representation of a database that can be converted back to JSON. This is a significant amount of extra work, so compaction itself will take much longer, consuming more CPU resources and memory, even though all that will take place in a separate thread. - RAFT-specific, i.e. can not be applied to the standalone database model. 2. If shallow copies can be created from database rows (BZ2069089) directly, the main thread could create a shallow copy of the current database state and hand it over to the worker thread for JSON conversion. Pros: - No unnecessary work, i.e. ovsdb-server will take roughly the same aggregated CPU time for compaction. - Implementation is storage-agnostic, so can, in theory, be applied to standalone databases (some changes of the DB log module may be needed). Cons: - Prerequisite: https://bugzilla.redhat.com/show_bug.cgi?id=2069089 The option 2 seems to make more sense performance-wise, so it is preferable. The main thread will still need to perform operations related to the file replacement and the raft log modifications.
Sent for review: https://patchwork.ozlabs.org/project/openvswitch/patch/20220630233407.623049-3-i.maximets@ovn.org/
Patches accepted upstream. Will be part of OVS 3.0 release.