Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1941615

Summary: [RFE] ovsdb-server: Support 2-Tier deployment (OVSDB Relay Service Model)
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Ilya Maximets <i.maximets>
Component: ovsdbAssignee: Ilya Maximets <i.maximets>
Status: CLOSED UPSTREAM QA Contact: Jianlin Shi <jishi>
Severity: high Docs Contact:
Priority: high    
Version: FDP 21.CCC: atragler, ctrautma, dalvarez, dcbw, dceara, echaudro, jhsiao, jmelvin, qding, ralongi, rsevilla
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-19 13:08:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1941632, 1941646    
Bug Blocks:    

Description Ilya Maximets 2021-03-22 13:38:01 UTC
Southbound DB in OVN deployment handles connections from all the
ovn-controllers.  While increasing the cluster scale, load on the
SBDB grows significantly.

In order to scale to 1K and higher we need to shift some of this load
somewhere or increase the raw performace of a single ovsdb-server.

One of the solutions could be changing a way how ovsdb-servers deployed in a
cluster.  If we can create one more layer of servers that will take care of
the clients we can significantly reduce load on a main SB DB and increase
the maximum number of clients the system can serve.

The deployment could look like this:

        +---------------------------------------------------------+
        | RAFT CLUSTER                                            |
        |           +---------+ ovsdb-server-1 +------+           |
        |           |                                 |           |
        |           +                                 +           |
        |     ovsdb-server-2 +---------------+ ovsdb-server-3     |
        |                                                         |
        +-----+-------------------+-----------------------------+-+
          |                       |                             |
          +                       +                             +
  +-+ ovsdb-relay-1 +-+      ovsdb-relay-2     ....        ovsdb-relay-N
  |        |          |      |  |  |  |  |                 |  |  |  |  |
  +        +          +                                                +
client-1 client-2 client-3      ....           ....          ....  client-M

In this scenario the main raft cluster acts as a main "storage".
ovsdb-replay-N servers are the same ovsdb-server processes, but running
in a replication mode, i.e. they are maintaining a copy of a data stored
in a main raft cluster.  All the "monitor" connections are server by there
replay servers directly.  Write trasactions are forwarded by replay servers
to the main raft cluster.

Setup like this will allow to distribute most of the load to as many relay
servers as needed.  And the main SB DB will need to only send updates to
N replication servers instead of M clients.

ovsdb-server right now supports replication mode (active-backup where backup
never becomes active).  Missing parts are:

  1. Allow replication of a clustered database (It doesn't support
     multiple remotes for replication and, probably, something else).

  2. Allow transaction forwarding to the replication source.  Right now
     ovsdb-server in a replication mode works as a read-only database, i.e.
     supports monitoring, but doesn't forward transactions.

Comment 1 Ilya Maximets 2021-04-20 09:47:42 UTC
For the implementation sanity purposes there was a slight change of
a concept.  Instead of replicating a cluster (i.e. connect to multiple
remotes and choose from which cluster member to sync) it's much easier
to replicate one particular server from the ovsdb cluster and allow
clients to choose to which replication server to connect.

It will look something like this:


        +---------------------------------------------------------+
        | RAFT CLUSTER                                            |
        |           +---------+ ovsdb-server-1 +------+           |
        |           |             +                   |           |
        |           +             |                   +           |
        | +--+ovsdb-server-2 +----|----------+ ovsdb-server-3+--+ |
        | |                       |                             | |
        +-|-----------------------|-----------------------------|-+
          |                       |                             |
          |               +-------+----------+        +---------+
          +               |       +          |        |         +
  +-+ ovsdb-relay-1 +-+   2  ovsdb-relay-3   4 ....  N-1   ovsdb-relay-N
  |        |          |      |  |  |  |  |                 |  |  |  |  |
  +        +          +                                                +
client-1 client-2 client-3      ....           ....          ....  client-M

To achieve that, replication of internal _Server database needs to be
implemented.  This way clients will know if the particular replicated
server is a cluster member of if it's healthy, and make a decision to
re-connect to a different replication server.  The one less convenient
part is that each client should have at least one replication server
for each "main" server in their list of remotes for a failover case.
But this deployment schema will allow us to avoid failover implementation
on the replication level. (It's possible, but will require some significant
re-work if we will want to avoid a lot of code duplications.  It's much
easier to use existing client-side failover mechanism for now.)
Also, there will be less number of re-connections to the main cluster,
because replication servers will never reconnect due to raft issues.
And this should be a good thing for a cluster stability.

BZ 1941632 updated accordingly.

Comment 2 Ilya Maximets 2021-06-02 16:33:40 UTC
OK.  We're back to the original design of from the discription :)

        +---------------------------------------------------------+
        | RAFT CLUSTER                                            |
        |           +---------+ ovsdb-server-1 +------+           |
        |           |                                 |           |
        |           +                                 +           |
        |     ovsdb-server-2 +---------------+ ovsdb-server-3     |
        |                                                         |
        +-----+-------------------+-----------------------------+-+
          |                       |                             |
          +                       +                             +
  +-+ ovsdb-relay-1 +-+      ovsdb-relay-2     ....        ovsdb-relay-N
  |        |          |      |  |  |  |  |                 |  |  |  |  |
  +        +          +                                                +
client-1 client-2 client-3      ....           ....          ....  client-M

I don't really like the concept from the comment #1 and implementation is
not that pretty.  So, I returned to the original concept of the architecture,
but with the different concept of the implementation.

New implementation is not based on the existing active-backup mode replication.
Instead I'm implementing a separate operation mode named "relay" for the
OVS database.  Implementation is based on the ovsdb client synchronization
library which is the same library that ovsdb IDL uses, i.e. the same state
machine that used by ovsdb clients.  It natively supports higher versions
of monitor requests including minitor_cond_since and update3 notifications,
it also natively supports multiple remotes and automatic re-connection.
New ovsdb relay mode will not have a file storage, i.e. will be fully
in memory.  To activate it, user will need instead of passing a database
file name to ovsdb-server, use a string in a following format:

  relay:<database name>:<list of remotes>

For example, to start relay database for a clustered Southbound database
in the OVN sandbox, ovsdb-server should be started with following arguments:

  make -j8 sandbox SANDBOXFLAGS="--sbdb-model=clustered"

  ovsdb-server --detach --no-chdir                            \
     --pidfile=sb-relay.pid --log-file=sb-relay.log           \
     --remote=db:OVN_Southbound,SB_Global,connections         \
     --private-key=db:OVN_Southbound,SSL,private_key          \
     --certificate=db:OVN_Southbound,SSL,certificate          \
     --ca-cert=db:OVN_Southbound,SSL,ca_cert                  \
     --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols      \
     --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers          \
     --unixctl=sb-relay --remote=punix:sb-relay.ovsdb         \
     relay:OVN_Southbound:unix:sb1.ovsdb,unix:sb2.ovsdb,unix:sb3.ovsdb

The only difference with the normal database start is that instead
of passing a database file name like "sb3.db" or "sb1.db", we're starting
relay database like this:

   relay:OVN_Southbound:unix:sb1.ovsdb,unix:sb2.ovsdb,unix:sb3.ovsdb

Where "relay" is a keyword, "OVN_Southbound" is the database name
and "unix:sb1.ovsdb,unix:sb2.ovsdb,unix:sb3.ovsdb" is the list of
remotes from which this database should be synchronized.  For tcp
based connections it may look like this:

   relay:OVN_Southbound:tcp:192.168.0.1:6642,tcp:192.168.0.2:6642,tcp:192.168.0.3:6642

ovsdb in relay mode executes all the "read" requests by itself and
forwards all the "write" requests to the database source.

It should not be used for leader-only connections, that means
that it should not be used for ovn-northd.  northd should connect
to the original database only.

Here is the openvswitch rpm scratch build with above functionality
implemented:
   http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.15/2.15.0/23.bz1941615.0.1.el8fdp/

Meanwhile, I'm working on finalizing the code and preparing the new
version of a patch-set for the upstream mailing list.

Comment 3 Ilya Maximets 2021-06-23 14:23:41 UTC
v2 with a full implementation of OVSDB Relay was posted upstream:
  https://patchwork.ozlabs.org/project/openvswitch/list/?series=248474&state=*

Here is a new build with it as it seems that the previous one expired:
  http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.15/2.15.0/23.bz1941615.0.10.el8fdp/

Comment 4 Ilya Maximets 2021-07-14 20:26:22 UTC
v3 of the OVSDB Relay patch-set is posted upstream including the
test results for tests that I run in our scale lab with ovn-heater:
  https://patchwork.ozlabs.org/project/openvswitch/list/?series=253496&state=*

Cover letter:
  https://patchwork.ozlabs.org/project/openvswitch/cover/20210714135023.373838-1-i.maximets@ovn.org/

Repeating the test results here for better visibility:

 Testing
 =======

Some scale tests were performed with OVSDB Relays that mimics OVN
workloads with ovn-kubernetes.
Tests performed with ovn-heater (https://github.com/dceara/ovn-heater)
on scenario ocp-120-density-heavy:
 https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml
In short, the test gradually creates a lot of OVN resources and
checks that network is configured correctly (by pinging diferent
namespaces).  The test includes 120 chassis (created by
ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs
with 15625 VIPs each, attached to all node LSes, etc.  Test performed
with monitor-all=true.

Note 1:
 - Memory consumption is checked at the end of a test in a following
   way: 1) check RSS 2) compact database 3) check RSS again.
   It's observed that ovn-controllers in this test are fairly slow
   and backlog builds up on monitors, because ovn-controllers are
   not able to receive updates fast enough.  This contributes to
   RSS of the process, especially in combination of glibc bug (glibc
   doesn't free fastbins back to the system).  Memory trimming on
   compaction is enabled in the test, so after compaction we can
   see more or less real value of the RSS at the end of the test
   wihtout backlog noise. (Compaction on relay in this case is
   just plain malloc_trim()).

Note 2:
 - I didn't collect memory consumption (RSS) after compaction for a
   test with 10 relays, because I got the idea only after the test
   was finished and another one already started.  And run takes
   significant amount of time.  So, values marked with a star (*)
   are an approximation based on results form other tests, hence
   might be not fully correct.

Note 3:
 - 'Max. poll' is a maximum of the 'long poll intervals' logged by
   ovsdb-server during the test.  Poll intervals that involved database
   compaction (huge disk writes) are same in all tests and excluded
   from the results.  (Sb DB size in the test is 256MB, fully
   compacted).  'Number of intervals' is just a number of logged
   unreasonably long poll intervals.
   Also note that ovsdb-server logs only compactions that took > 1s,
   so poll intervals that involved compaction, but under 1s can not
   be reliably excluded from the test results.
   'central' - main Sb DB servers.
   'relay'   - relay servers connected to central ones.
   'before'/'after' - RSS before and after compaction + malloc_trim().
   'time' - is a total time the process spent in Running state.


Baseline (3 main servers, 0 relays):
++++++++++++++++++++++++++++++++++++++++

               RSS
central  before    after    clients  time     Max. poll   Number of intervals
         7552924   3828848   ~41     109:50   5882        1249
         7342468   4109576   ~43     108:37   5717        1169
         5886260   4109496   ~39      96:31   4990        1233
         ---------------------------------------------------------------------
             20G       12G   126     314:58   5882        3651

3x3 (3 main servers, 3 relays):
+++++++++++++++++++++++++++++++

                RSS
central  before    after    clients  time     Max. poll   Number of intervals
         6228176   3542164   ~1-5    36:53    2174        358
         5723920   3570616   ~1-5    24:03    2205        382
         5825420   3490840   ~1-5    35:42    2214        309
         ---------------------------------------------------------------------
           17.7G     10.6G      9    96:38    2214        1049

relay    before    after    clients  time     Max. poll   Number of intervals
         2174328    726576    37     69:44    5216        627
         2122144    729640    32     63:52    4767        625
         2824160    751384    51     89:09    5980        627
         ---------------------------------------------------------------------
              7G      2.2G    120   222:45    5980        1879

Total:   =====================================================================
           24.7G     12.8G    129    319:23   5980        2928

3x10 (3 main servers, 10 relays):
+++++++++++++++++++++++++++++++++

               RSS
central  before    after    clients  time    Max. poll   Number of intervals
         6190892    ---      ~1-6    42:43   2041         634
         5687576    ---      ~1-5    27:09   2503         405
         5958432    ---      ~1-7    40:44   2193         450
         ---------------------------------------------------------------------
           17.8G   ~10G*       16   110:36   2503         1489

relay    before    after    clients  time    Max. poll   Number of intervals
         1331256    ---       9      22:58   1327         140
         1218288    ---      13      28:28   1840         621
         1507644    ---      19      41:44   2869         623
         1257692    ---      12      27:40   1532         517
         1125368    ---       9      22:23   1148         105
         1380664    ---      16      35:04   2422         619
         1087248    ---       6      18:18   1038           6
         1277484    ---      14      34:02   2392         616
         1209936    ---      10      25:31   1603         451
         1293092    ---      12      29:03   2071         621
         ---------------------------------------------------------------------
           12.6G    5-7G*    120    285:11   2869         4319

Total:   =====================================================================
           30.4G    15-17G*  136    395:47   2869         5808


 Conclusions from the test:
 ==========================

1. Relays relieve a lot of pressure from main Sb DB servers.
   In my testing total CPU time on main servers goes down from 314
   to 96-110 minutes, which is 3 times lower.
   During the test, number of registered 'unreasonably poll interval's
   on main servers goes down by 3-4 times.  At the same time the
   maximum duration of these intervals goes down by a factor of 2.5.
   Also, factor should be higher with increased number of clients.

2. Since number of clients is significantly lower, memory consumption
   of main Sb DB servers also goes down by ~12%.

3. For the 3x3 test total memory consumed by all processes increased
   only by 6%.  And total CPU usage increased by 1.2%.  Poll intervals
   on relay servers are comparable to poll intervals on main servers
   with no relays, but poll intervals on main servers are significantly
   better (see conclusion # 1).  In general, it seems that for this
   test running with 3 relays next to 3 main Sb DB servers significantly
   increases cluster stability and responsiveness without noticeable
   increase in memory or CPU usage.

4. For the 3x10 test total memory consumed by all processes increased
   by ~25-40%*.  And total CPU usage increased by 26% in compare with
   baseline setup.  At the same time poll intervals on both main
   and relay servers are lower by a factor of 2-4 (depends on a
   particular server).  In general, cluster with 10 relays is much more
   stable and responsive with a reasonably low memory consumption and
   CPU time overhead.

Comment 5 Ilya Maximets 2021-07-19 13:08:45 UTC
Patches got accepted and OVSDB Relay Service Model will be available
as part of the upstream OVS 2.16 release.

Comment 6 Yaniv Kaul 2021-07-28 14:27:43 UTC
(In reply to Ilya Maximets from comment #5)
> Patches got accepted and OVSDB Relay Service Model will be available
> as part of the upstream OVS 2.16 release.

That's great to see, but since this BZ is attached to a customer case, I feel we should re-open it and see that eventually it goes to a release our customer will consume.
Thoughts?

Comment 7 Dan Williams 2021-07-28 19:47:44 UTC
(In reply to Yaniv Kaul from comment #6)
> (In reply to Ilya Maximets from comment #5)
> > Patches got accepted and OVSDB Relay Service Model will be available
> > as part of the upstream OVS 2.16 release.
> 
> That's great to see, but since this BZ is attached to a customer case, I
> feel we should re-open it and see that eventually it goes to a release our
> customer will consume.
> Thoughts?

Not to diminish the severity of this specific customer case, but it was internal and was stabilized through changes to Neutron rather than OVSDB itself. The OVN team also suggested that OSP should use RAFT which would distribute load to three databases rather than just one in the current active/backup config. This bugzilla is about a future enhancement that will be present in OVS 2.16 to allow further scale with RAFT database clusters, but is not needed for the attached customer case at this time. Also note that OSP typically does not use even-numbered OVS releases, but waits for the yearly odd release; so even though this feature is in OVS 2.16 OSP would not pick it up until Feb/March next year if they even use it. That's a very long time for this bug to sit open when it's been fixed upstream.

I would advocate leaving this bug closed for now.

Comment 8 Yaniv Kaul 2021-07-29 09:40:32 UTC
Thanks Dan for the great explanation!