Bug 1941615
| Summary: | [RFE] ovsdb-server: Support 2-Tier deployment (OVSDB Relay Service Model) | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Ilya Maximets <i.maximets> |
| Component: | ovsdb | Assignee: | Ilya Maximets <i.maximets> |
| Status: | CLOSED UPSTREAM | QA Contact: | Jianlin Shi <jishi> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | FDP 21.C | CC: | atragler, ctrautma, dalvarez, dcbw, dceara, echaudro, jhsiao, jmelvin, qding, ralongi, rsevilla |
| Target Milestone: | --- | Keywords: | FutureFeature |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-19 13:08:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1941632, 1941646 | ||
| Bug Blocks: | |||
For the implementation sanity purposes there was a slight change of
a concept. Instead of replicating a cluster (i.e. connect to multiple
remotes and choose from which cluster member to sync) it's much easier
to replicate one particular server from the ovsdb cluster and allow
clients to choose to which replication server to connect.
It will look something like this:
+---------------------------------------------------------+
| RAFT CLUSTER |
| +---------+ ovsdb-server-1 +------+ |
| | + | |
| + | + |
| +--+ovsdb-server-2 +----|----------+ ovsdb-server-3+--+ |
| | | | |
+-|-----------------------|-----------------------------|-+
| | |
| +-------+----------+ +---------+
+ | + | | +
+-+ ovsdb-relay-1 +-+ 2 ovsdb-relay-3 4 .... N-1 ovsdb-relay-N
| | | | | | | | | | | | |
+ + + +
client-1 client-2 client-3 .... .... .... client-M
To achieve that, replication of internal _Server database needs to be
implemented. This way clients will know if the particular replicated
server is a cluster member of if it's healthy, and make a decision to
re-connect to a different replication server. The one less convenient
part is that each client should have at least one replication server
for each "main" server in their list of remotes for a failover case.
But this deployment schema will allow us to avoid failover implementation
on the replication level. (It's possible, but will require some significant
re-work if we will want to avoid a lot of code duplications. It's much
easier to use existing client-side failover mechanism for now.)
Also, there will be less number of re-connections to the main cluster,
because replication servers will never reconnect due to raft issues.
And this should be a good thing for a cluster stability.
BZ 1941632 updated accordingly.
OK. We're back to the original design of from the discription :)
+---------------------------------------------------------+
| RAFT CLUSTER |
| +---------+ ovsdb-server-1 +------+ |
| | | |
| + + |
| ovsdb-server-2 +---------------+ ovsdb-server-3 |
| |
+-----+-------------------+-----------------------------+-+
| | |
+ + +
+-+ ovsdb-relay-1 +-+ ovsdb-relay-2 .... ovsdb-relay-N
| | | | | | | | | | | | |
+ + + +
client-1 client-2 client-3 .... .... .... client-M
I don't really like the concept from the comment #1 and implementation is
not that pretty. So, I returned to the original concept of the architecture,
but with the different concept of the implementation.
New implementation is not based on the existing active-backup mode replication.
Instead I'm implementing a separate operation mode named "relay" for the
OVS database. Implementation is based on the ovsdb client synchronization
library which is the same library that ovsdb IDL uses, i.e. the same state
machine that used by ovsdb clients. It natively supports higher versions
of monitor requests including minitor_cond_since and update3 notifications,
it also natively supports multiple remotes and automatic re-connection.
New ovsdb relay mode will not have a file storage, i.e. will be fully
in memory. To activate it, user will need instead of passing a database
file name to ovsdb-server, use a string in a following format:
relay:<database name>:<list of remotes>
For example, to start relay database for a clustered Southbound database
in the OVN sandbox, ovsdb-server should be started with following arguments:
make -j8 sandbox SANDBOXFLAGS="--sbdb-model=clustered"
ovsdb-server --detach --no-chdir \
--pidfile=sb-relay.pid --log-file=sb-relay.log \
--remote=db:OVN_Southbound,SB_Global,connections \
--private-key=db:OVN_Southbound,SSL,private_key \
--certificate=db:OVN_Southbound,SSL,certificate \
--ca-cert=db:OVN_Southbound,SSL,ca_cert \
--ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols \
--ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers \
--unixctl=sb-relay --remote=punix:sb-relay.ovsdb \
relay:OVN_Southbound:unix:sb1.ovsdb,unix:sb2.ovsdb,unix:sb3.ovsdb
The only difference with the normal database start is that instead
of passing a database file name like "sb3.db" or "sb1.db", we're starting
relay database like this:
relay:OVN_Southbound:unix:sb1.ovsdb,unix:sb2.ovsdb,unix:sb3.ovsdb
Where "relay" is a keyword, "OVN_Southbound" is the database name
and "unix:sb1.ovsdb,unix:sb2.ovsdb,unix:sb3.ovsdb" is the list of
remotes from which this database should be synchronized. For tcp
based connections it may look like this:
relay:OVN_Southbound:tcp:192.168.0.1:6642,tcp:192.168.0.2:6642,tcp:192.168.0.3:6642
ovsdb in relay mode executes all the "read" requests by itself and
forwards all the "write" requests to the database source.
It should not be used for leader-only connections, that means
that it should not be used for ovn-northd. northd should connect
to the original database only.
Here is the openvswitch rpm scratch build with above functionality
implemented:
http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.15/2.15.0/23.bz1941615.0.1.el8fdp/
Meanwhile, I'm working on finalizing the code and preparing the new
version of a patch-set for the upstream mailing list.
v2 with a full implementation of OVSDB Relay was posted upstream: https://patchwork.ozlabs.org/project/openvswitch/list/?series=248474&state=* Here is a new build with it as it seems that the previous one expired: http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.15/2.15.0/23.bz1941615.0.10.el8fdp/ v3 of the OVSDB Relay patch-set is posted upstream including the test results for tests that I run in our scale lab with ovn-heater: https://patchwork.ozlabs.org/project/openvswitch/list/?series=253496&state=* Cover letter: https://patchwork.ozlabs.org/project/openvswitch/cover/20210714135023.373838-1-i.maximets@ovn.org/ Repeating the test results here for better visibility: Testing ======= Some scale tests were performed with OVSDB Relays that mimics OVN workloads with ovn-kubernetes. Tests performed with ovn-heater (https://github.com/dceara/ovn-heater) on scenario ocp-120-density-heavy: https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml In short, the test gradually creates a lot of OVN resources and checks that network is configured correctly (by pinging diferent namespaces). The test includes 120 chassis (created by ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs with 15625 VIPs each, attached to all node LSes, etc. Test performed with monitor-all=true. Note 1: - Memory consumption is checked at the end of a test in a following way: 1) check RSS 2) compact database 3) check RSS again. It's observed that ovn-controllers in this test are fairly slow and backlog builds up on monitors, because ovn-controllers are not able to receive updates fast enough. This contributes to RSS of the process, especially in combination of glibc bug (glibc doesn't free fastbins back to the system). Memory trimming on compaction is enabled in the test, so after compaction we can see more or less real value of the RSS at the end of the test wihtout backlog noise. (Compaction on relay in this case is just plain malloc_trim()). Note 2: - I didn't collect memory consumption (RSS) after compaction for a test with 10 relays, because I got the idea only after the test was finished and another one already started. And run takes significant amount of time. So, values marked with a star (*) are an approximation based on results form other tests, hence might be not fully correct. Note 3: - 'Max. poll' is a maximum of the 'long poll intervals' logged by ovsdb-server during the test. Poll intervals that involved database compaction (huge disk writes) are same in all tests and excluded from the results. (Sb DB size in the test is 256MB, fully compacted). 'Number of intervals' is just a number of logged unreasonably long poll intervals. Also note that ovsdb-server logs only compactions that took > 1s, so poll intervals that involved compaction, but under 1s can not be reliably excluded from the test results. 'central' - main Sb DB servers. 'relay' - relay servers connected to central ones. 'before'/'after' - RSS before and after compaction + malloc_trim(). 'time' - is a total time the process spent in Running state. Baseline (3 main servers, 0 relays): ++++++++++++++++++++++++++++++++++++++++ RSS central before after clients time Max. poll Number of intervals 7552924 3828848 ~41 109:50 5882 1249 7342468 4109576 ~43 108:37 5717 1169 5886260 4109496 ~39 96:31 4990 1233 --------------------------------------------------------------------- 20G 12G 126 314:58 5882 3651 3x3 (3 main servers, 3 relays): +++++++++++++++++++++++++++++++ RSS central before after clients time Max. poll Number of intervals 6228176 3542164 ~1-5 36:53 2174 358 5723920 3570616 ~1-5 24:03 2205 382 5825420 3490840 ~1-5 35:42 2214 309 --------------------------------------------------------------------- 17.7G 10.6G 9 96:38 2214 1049 relay before after clients time Max. poll Number of intervals 2174328 726576 37 69:44 5216 627 2122144 729640 32 63:52 4767 625 2824160 751384 51 89:09 5980 627 --------------------------------------------------------------------- 7G 2.2G 120 222:45 5980 1879 Total: ===================================================================== 24.7G 12.8G 129 319:23 5980 2928 3x10 (3 main servers, 10 relays): +++++++++++++++++++++++++++++++++ RSS central before after clients time Max. poll Number of intervals 6190892 --- ~1-6 42:43 2041 634 5687576 --- ~1-5 27:09 2503 405 5958432 --- ~1-7 40:44 2193 450 --------------------------------------------------------------------- 17.8G ~10G* 16 110:36 2503 1489 relay before after clients time Max. poll Number of intervals 1331256 --- 9 22:58 1327 140 1218288 --- 13 28:28 1840 621 1507644 --- 19 41:44 2869 623 1257692 --- 12 27:40 1532 517 1125368 --- 9 22:23 1148 105 1380664 --- 16 35:04 2422 619 1087248 --- 6 18:18 1038 6 1277484 --- 14 34:02 2392 616 1209936 --- 10 25:31 1603 451 1293092 --- 12 29:03 2071 621 --------------------------------------------------------------------- 12.6G 5-7G* 120 285:11 2869 4319 Total: ===================================================================== 30.4G 15-17G* 136 395:47 2869 5808 Conclusions from the test: ========================== 1. Relays relieve a lot of pressure from main Sb DB servers. In my testing total CPU time on main servers goes down from 314 to 96-110 minutes, which is 3 times lower. During the test, number of registered 'unreasonably poll interval's on main servers goes down by 3-4 times. At the same time the maximum duration of these intervals goes down by a factor of 2.5. Also, factor should be higher with increased number of clients. 2. Since number of clients is significantly lower, memory consumption of main Sb DB servers also goes down by ~12%. 3. For the 3x3 test total memory consumed by all processes increased only by 6%. And total CPU usage increased by 1.2%. Poll intervals on relay servers are comparable to poll intervals on main servers with no relays, but poll intervals on main servers are significantly better (see conclusion # 1). In general, it seems that for this test running with 3 relays next to 3 main Sb DB servers significantly increases cluster stability and responsiveness without noticeable increase in memory or CPU usage. 4. For the 3x10 test total memory consumed by all processes increased by ~25-40%*. And total CPU usage increased by 26% in compare with baseline setup. At the same time poll intervals on both main and relay servers are lower by a factor of 2-4 (depends on a particular server). In general, cluster with 10 relays is much more stable and responsive with a reasonably low memory consumption and CPU time overhead. Patches got accepted and OVSDB Relay Service Model will be available as part of the upstream OVS 2.16 release. (In reply to Ilya Maximets from comment #5) > Patches got accepted and OVSDB Relay Service Model will be available > as part of the upstream OVS 2.16 release. That's great to see, but since this BZ is attached to a customer case, I feel we should re-open it and see that eventually it goes to a release our customer will consume. Thoughts? (In reply to Yaniv Kaul from comment #6) > (In reply to Ilya Maximets from comment #5) > > Patches got accepted and OVSDB Relay Service Model will be available > > as part of the upstream OVS 2.16 release. > > That's great to see, but since this BZ is attached to a customer case, I > feel we should re-open it and see that eventually it goes to a release our > customer will consume. > Thoughts? Not to diminish the severity of this specific customer case, but it was internal and was stabilized through changes to Neutron rather than OVSDB itself. The OVN team also suggested that OSP should use RAFT which would distribute load to three databases rather than just one in the current active/backup config. This bugzilla is about a future enhancement that will be present in OVS 2.16 to allow further scale with RAFT database clusters, but is not needed for the attached customer case at this time. Also note that OSP typically does not use even-numbered OVS releases, but waits for the yearly odd release; so even though this feature is in OVS 2.16 OSP would not pick it up until Feb/March next year if they even use it. That's a very long time for this bug to sit open when it's been fixed upstream. I would advocate leaving this bug closed for now. Thanks Dan for the great explanation! |
Southbound DB in OVN deployment handles connections from all the ovn-controllers. While increasing the cluster scale, load on the SBDB grows significantly. In order to scale to 1K and higher we need to shift some of this load somewhere or increase the raw performace of a single ovsdb-server. One of the solutions could be changing a way how ovsdb-servers deployed in a cluster. If we can create one more layer of servers that will take care of the clients we can significantly reduce load on a main SB DB and increase the maximum number of clients the system can serve. The deployment could look like this: +---------------------------------------------------------+ | RAFT CLUSTER | | +---------+ ovsdb-server-1 +------+ | | | | | | + + | | ovsdb-server-2 +---------------+ ovsdb-server-3 | | | +-----+-------------------+-----------------------------+-+ | | | + + + +-+ ovsdb-relay-1 +-+ ovsdb-relay-2 .... ovsdb-relay-N | | | | | | | | | | | | | + + + + client-1 client-2 client-3 .... .... .... client-M In this scenario the main raft cluster acts as a main "storage". ovsdb-replay-N servers are the same ovsdb-server processes, but running in a replication mode, i.e. they are maintaining a copy of a data stored in a main raft cluster. All the "monitor" connections are server by there replay servers directly. Write trasactions are forwarded by replay servers to the main raft cluster. Setup like this will allow to distribute most of the load to as many relay servers as needed. And the main SB DB will need to only send updates to N replication servers instead of M clients. ovsdb-server right now supports replication mode (active-backup where backup never becomes active). Missing parts are: 1. Allow replication of a clustered database (It doesn't support multiple remotes for replication and, probably, something else). 2. Allow transaction forwarding to the replication source. Right now ovsdb-server in a replication mode works as a read-only database, i.e. supports monitoring, but doesn't forward transactions.