The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1906940 - [OVN SCALE] Use column diffs for ovsdb and raft log entries
Summary: [OVN SCALE] Use column diffs for ovsdb and raft log entries
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch2.15
Version: FDP 20.I
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: FDP 21.B
Assignee: Ilya Maximets
QA Contact: Zhiqiang Fang
URL:
Whiteboard:
Depends On:
Blocks: 1908916
TreeView+ depends on / blocked
 
Reported: 2020-12-11 21:12 UTC by Ilya Maximets
Modified: 2021-03-15 14:34 UTC (History)
11 users (show)

Fixed In Version: openvswitch2.15-2.15.0-1.el8fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-15 14:33:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2021:0838 0 None None None 2021-03-15 14:34:08 UTC

Internal Links: 1917979

Description Ilya Maximets 2020-12-11 21:12:44 UTC
This BZ is to track efforts for the following ovsdb change:
https://patchwork.ozlabs.org/project/openvswitch/patch/20201211205447.3874314-1-i.maximets@ovn.org/

This patch could significantly reduce memory consumption of ovsdb-server
processes across the raft cluster and size of database files.  It might also
improve performance.

Here is the problem description and some test results from the commit message:

Currently, ovsdb-server stores complete value for the column in a database
file and in a raft log in case this column changed.  This means that
transaction that adds, for example, one new acl to a port group creates
a log entry with all UUIDs of all existing acls + one new.  Same for
ports in logical switches and routers and more other columns with sets
in Northbound DB.

There could be thousands of acls in one port group or thousands of ports
in a single logical switch.  And the typical use case is to add one new
if we're starting a new service/VM/container or adding one new node in a
kubernetes or OpenStack cluster.  This generates huge amount of traffic
within ovsdb raft cluster, grows overall memory consumption and hurts
performance since all these UUIDs are parsed and formatted to/from json
several times and stored on disks.  And more values we have in a set -
more space a single log entry will occupy and more time it will take to
process by ovsdb-server cluster members.

Simple test:

1. Start OVN sandbox with clustered DBs:
   # make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered'

2. Run a script that creates one port group and adds 4000 acls into it:
   # cat ../memory-test.sh
   pg_name=my_port_group
   export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off)
   ovn-nbctl pg-add $pg_name
   for i in $(seq 1 4000); do
     echo "Iteration: $i"
     ovn-nbctl --log acl-add $pg_name from-lport $i udp drop
   done
   ovn-nbctl acl-del $pg_name
   ovn-nbctl pg-del $pg_name
   ovs-appctl -t $(pwd)/sandbox/nb1 memory/show
   ovn-appctl -t ovn-nbctl exit
   ---

4. Check the current memory consumption of ovsdb-server processes and
   space occupied by database files:
   # ls sandbox/[ns]b*.db -alh
   # ps -eo vsz,rss,comm,cmd | egrep '=[ns]b[123].pid'

Test results with current ovsdb log format:

   On-disk Nb DB size     :  ~369 MB
   RSS of Nb ovsdb-servers:  ~2.7 GB
   Time to finish the test:  ~2m

In order to mitigate memory consumption issues and reduce computational
load on ovsdb-servers let's store diff between old and new values
instead.  This will make size of each log entry that adds single acl to
port group (or port to logical switch or anything else like that) very
small and independent from the number of already existing acls (ports,
etc.).

Added a new marker '_is_diff' into a file transaction to specify that
this transaction contains diffs instead of replacements for the existing
data.

One side effect is that this change will actually increase the size of
file transaction that removes more than a half of entries from the set,
because diff will be larger than the resulted new value.  However, such
operations are rare.

Test results with change applied:

   On-disk Nb DB size     :  ~2.7 MB  ---> reduced by 99%
   RSS of Nb ovsdb-servers:  ~580 MB  ---> reduced by 78%
   Time to finish the test:  ~1m27s   ---> reduced by 27%


Attention:

After this change new ovsdb-server is still able to read old databases,
but old ovsdb-server will not be able to read new ones.
Since new servers could join ovsdb cluster dynamically it's hard to
implement any runtime mechanism to handle cases where different
versions of ovsdb-server joins the cluster.  However we still need to
handle cluster upgrades.  For this case added special command line
argument to disable new functionality.  Documentation updated with the
recommended way to upgrade the ovsdb cluster.

Comment 1 Ilya Maximets 2020-12-15 18:08:01 UTC
I prepared a scratch build of the openvswitch package with this patch applied.
Available here:
 https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33760390
or
 http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.13/2.13.0/78.el8fdp/x86_64/


@Dan, can we ask someone from the perf scale team to test?

The main thing to check is the memory consumption graph for the northbound
and southbound databases during the scale test.  And also, size of on-disk
database files.  Overall performance is interesting as usual.

Comment 2 Raul Sevilla 2020-12-17 09:52:37 UTC
After running some perf-scale tests using the RPMs provided by Ilya I can confirm a big improvement in terms of memory usage from the NBDB and SBDB components.

These are the results I got after triggering 2k iterations of the cluster-density workload on a 125 node cluster:

- The three OVN SBDB containers ended up with a RSS memory usage of ~1.56 GiB, with some periodic memory spikes reaching ~2.4GiB, my suspects are these spikes are directly related with the ovsdb compaction process.
- On the other hand, NBDB containers ended up with an usage of ~200MiB, reaching a maximum of 254MiB

Comparing with previous results these results represent an optimization of up to ~40% of memory usage from sbdb and more than 90% in nbdb.
It's also important to note a ~40% reduction of the I/O activity (disk write throughput) generated by these containers. This is also important due to these components are colocated with etcd by default.


All tests were triggered after enabling memory trimming and compacting databases manually with:

```
for p in $(oc get -n openshift-ovn-kubernetes pod -o name -l app=ovnkube-master); do
  oc exec -c nbdb $p -- ovn-appctl -t /var/run/ovn/ovnnb_db.ctl ovsdb-server/compact
  oc exec -c nbdb $p -- ovn-appctl -t /var/run/ovn/ovnnb_db.ctl ovsdb-server/memory-trim-on-compaction on
  oc exec -c sbdb $p -- ovn-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb-server/compact
  oc exec -c sbdb $p -- ovn-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb-server/memory-trim-on-compaction on
done
```

PS: The workload finished correctly and I didn't observe any regression.

Comment 4 Ilya Maximets 2021-01-19 17:36:00 UTC
Patch was accepted in upstream and will be patch of openvswitch 2.15 release.
http://patchwork.ozlabs.org/project/openvswitch/patch/20210114011121.4140984-1-i.maximets@ovn.org/

Comment 5 Ilya Maximets 2021-01-26 19:24:42 UTC
Changing component to openvswitch since implementation belongs there.

Comment 14 errata-xmlrpc 2021-03-15 14:33:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new package: openvswitch2.15), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:0838


Note You need to log in before you can comment on or make changes to this bug.