Bug 1888829 - NBDB memory increases to around 18 GB, northd CPU pegged at 100%
Summary: NBDB memory increases to around 18 GB, northd CPU pegged at 100%
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch2.13
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Open vSwitch development team
QA Contact: ovs-qe
URL:
Whiteboard: aos-scalability-46
Depends On:
Blocks: 1937118 1908774
TreeView+ depends on / blocked
 
Reported: 2020-10-15 20:42 UTC by Tim Rozet
Modified: 2021-03-29 01:47 UTC (History)
16 users (show)

Fixed In Version: openvswitch2.13-2.13.0-61.el7fdp openvswitch2.13-2.13.0-70.el8fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-04 22:29:08 UTC
Target Upstream Version:


Attachments (Terms of Use)
dbs, logs (4.40 MB, application/gzip)
2020-10-15 20:43 UTC, Tim Rozet
no flags Details
previous nbdb log (993.15 KB, text/plain)
2020-10-19 19:20 UTC, Tim Rozet
no flags Details

Description Tim Rozet 2020-10-15 20:42:13 UTC
Description of problem:
While investigating 1887585 I can see that NBDB memory is extremely high on every node (between 17-20GB), while northd is at 100% cpu on every node. The real disk size of the nbdb is only several MB.

After triggering core dumps by killing northd and nbdb, the new instances have less than 1MB of memory and 0% cpu.

Will attach the logs and dbs. I also have coredumps but they are too big to attach to bz.

Version-Release number of selected component (if applicable):

[root@ip-10-0-159-165 ~]# rpm -qa |grep ovn
ovn2.13-20.06.2-11.el8fdp.x86_64
ovn2.13-host-20.06.2-11.el8fdp.x86_64
ovn2.13-central-20.06.2-11.el8fdp.x86_64
ovn2.13-vtep-20.06.2-11.el8fdp.x86_64
[root@ip-10-0-159-165 ~]# rpm -qa |grep openvswitch
openvswitch2.13-2.13.0-57.el8fdp.x86_64
openvswitch2.13-devel-2.13.0-57.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch

Comment 1 Tim Rozet 2020-10-15 20:43:11 UTC
Created attachment 1721906 [details]
dbs, logs

Comment 2 Ilya Maximets 2020-10-16 00:51:22 UTC
Looking at the logs the issue is similar to what we fixed in
https://bugzilla.redhat.com/show_bug.cgi?id=1834838.  The fix is working
and ovsdb-server actually drops a lot of duplicated requests to install
exactly same snaphots, but one of NB DB servers is not doing well and backlog
is still growing.  Not sure what are these raft messages exactly, but
at some point sending backlog on one of the raft connections exceeded 61 GB: 

2020-10-15T11:40:12Z|05907|jsonrpc|INFO|Dropped 874 log messages in last 12 seconds (most recently, 0 seconds ago) due to excessive rate
2020-10-15T11:40:12Z|05908|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:10.0.205.228:44384, num of msgs: 15370, backlog: 61731060773.

According to above statistics, it's ~3.8 MB per message that actually looks
suspiciously close to the size of NB DB in this test, so these are still
might be snapshots.  Also, at the point where ovsdb-server starts to frequently
report installation of exact same snapshots, backlog quickly starts to go
down:

2020-10-15T11:53:49Z|06276|raft|WARN|not sending exact same install_snapshot_request to server 007c again
2020-10-15T11:53:51Z|06278|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:10.0.205.228:44384, num of msgs: 7679, backlog: 22194409937.
...
2020-10-15T12:00:49Z|06416|raft|WARN|not sending exact same install_snapshot_request to server 007c again
2020-10-15T12:01:01Z|06420|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:10.0.205.228:44384, num of msgs: 2637, backlog: 507864302.

So, I guess, that it's almost the same issue as in 1834838, but for some
reason we can not detect that snapshots are equal, or snapshots are
actually a little bit different.

Need to investigate if snapshots could be detected as different (or if they
are actually different).  Another temporary solution and an emergency measure
for similar issues in the future might be to introduce a threshold and drop
connection if backlog exceeds this threshold to avoid memory exhaustion.

Unfortunately, NB DB server that caused excessive sending backlog on 2 other
servers was restarted and there are no interesting logs for this machine in
the attached archive.
Tim, is it possible to get these logs somehow?  It might be good to look
at what happened to the server that it stopped processing requests in time.

Comment 3 Tim Rozet 2020-10-19 19:19:39 UTC
Was able to get the old container log. Will attach.

Comment 4 Tim Rozet 2020-10-19 19:20:16 UTC
Created attachment 1722679 [details]
previous nbdb log

Comment 5 Ilya Maximets 2020-10-21 10:18:22 UTC
I prepared a couple of patches for the raft implementation in ovsdb-server
that should prevent growing of backlog on raft connections.  Scratch builds
for el8 and el7 are available here:
  http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.13/2.13.0/67.el8fdp/
  http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.13/2.13.0/57.el7fdp/
  
Tim, could you, please, test with them?

I didn't send these patches upsteam yet because I'm not 100% sure that all of
them are necessary, but they includes some more logging and we will be able
to see a clearer picture of what is going on.

In short, what I did:
1. Limited number of 'install_snapshot_request's to one at a time.
2. Fixed overflow of 'raft-backlog' counter in memory report.
   (this one I sent upstream already)
3. Introduced hard limits on the jsonrpc backlog for raft connections.
   For now it's limited to 100 messages and 2GB of backlog, but I plan
   to make it configurable, if the effectiveness will be proved.
   Once backlog limit reached, connection will be dropped and re-established
   clearing all the remaining backlog.

All above changes are targeted for excessive backlog issue, however, looking
closely to logs, I see that there is also a complex of various issues that
are, likely, connected with each other.  So, it might be not enough to fix
backlogs.

Few observations:
1. Sometimes it takes > 15 seconds for a single polling internal.  This means
   that there are requests that takes so much time.  But election timer for
   NB DB is set to 10 seconds.  This leads to leader changes and attempts of
   a server that processed this kind of request to catch up.  It might make
   sense to increase election timer at least to 16 seconds as it done for
   Southbound DB.
2. Looking at the SB DB logs from must gather collected for 1887585, it looks
   like SB DB itself is very big.  It contains ~11M of cells.  So, here we,
   I suppose, have a lot of reject ACLs.  Might make sense to try solution
   from https://bugzilla.redhat.com/show_bug.cgi?id=1859924 .
3. In the same must gather archive there is a suspicious repeating log from
   ovn-nbctl daemons:
     |stream_ssl|ERR|SSL_use_PrivateKey_file: error:02001002:system library:fopen:No such file or directory
     |stream_ssl|ERR|SSL_use_certificate_file: error:20074002:BIO routines:file_ctrl:system lib
     |stream_ssl|ERR|Certificate must be configured to use SSL
     |stream_ssl|ERR|Private key must be configured to use SSL
   ovn-nbctl daemon successfully started only on one of 3 master nodes.
   I'm not sure if this is the issue, but looks suspicious.

Comment 6 Tim Rozet 2020-10-21 15:09:24 UTC
Thanks Ilya. I also suspect the service reject ACLs are what triggered this problem:
https://bugzilla.redhat.com/show_bug.cgi?id=1887585#c9

The fixes for reject ACL in NB are not downstream yet. I think we should try the original scale test that triggered this issue with your new RPM. That way we don't have too many variables and can see if your fix in this RPM has an impact. I'll work with Raul to run the test and report back.

Comment 10 Ilya Maximets 2020-10-22 12:22:33 UTC
I looked through the new logs and I see that the scale of this test is not
enough to trigger backlog issues, so we're not able to see If any of my
changes have any effect.

Is it possible to run with the original scale?

For the memory consumption that doesn't go away after removing pods and
services, I'm trying to craft a minimal test with jut OVN sandbox to be
able to profile/debug locally.

Comment 11 Ilya Maximets 2020-10-22 15:19:48 UTC
I prepared following test with ovn-sandbox:

 make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered'
 # cat ../memory-test.sh 
 pg_name=my_port_group_$1
 ovs-appctl -t $(pwd)/sandbox/nb1 vlog/set raft:dbg
 ovs-appctl -t $(pwd)/sandbox/nb2 vlog/set raft:dbg
 ovs-appctl -t $(pwd)/sandbox/nb3 vlog/set raft:dbg
 export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off)
 ovn-nbctl pg-add $pg_name
 ovn-nbctl list port-group
 for i in $(seq 1 4000); do
     date
     echo "Iteration: $i"
     ovn-nbctl --log acl-add $pg_name from-lport $i udp drop
     ovs-appctl -t $(pwd)/sandbox/nb1 memory/show
 done
 ovn-nbctl acl-del $pg_name
 ovn-nbctl pg-del $pg_name
 ovs-appctl -t $(pwd)/sandbox/nb1 memory/show
 ovn-appctl -t ovn-nbctl exit

It just creates a single port group and adds 4000 acls, after that removes all
the acls and the port group.

I can confirm that after a single run memory grows up to 6-7GB and nbdb doesn't
free this memory after removing everything from the database.
However, if I'm running the same test second time, memory consumption stays the
same, i.e. nbdb reuses already allocated for the first test memory.  OTOH, it
seems that it wasn't the case for the test performed in
https://bugzilla.redhat.com/show_bug.cgi?id=1888829#c7
i.e. nbdb doesn't seem to reuse memory in that test.  Looking further into this.

Comment 12 Ilya Maximets 2020-10-22 16:34:20 UTC
(In reply to Ilya Maximets from comment #11)
> I can confirm that after a single run memory grows up to 6-7GB and nbdb
> doesn't
> free this memory after removing everything from the database.

One correction: This test was performed under memory sanitizer, so the
real memory consumption without it would be 2-3 times lower.

Comment 13 Ilya Maximets 2020-10-22 21:19:35 UTC
OK.  I found what is going on with regards to memory not freed/reused.

RAFT implementation inside ovsdb-server stores the log of operations. This is
required to keep track of which operations are in-flight and which could be
committed.  This is a basic element of RAFT.

During compaction all committed log entries are merged into a base snapshot.
However, compaction happens automatically only once in 10 minutes.  During this
10 minute period all the transactions accumulated in the log.  It could be
tens of thousands of transactions each of which could be fairly large.  e.g.
Addition of a single port to a port group that already contains 1000 ports will
take ~100KB in json format.  10K of such transactions will take ~1GB in json
format and more in parsed representation inside ovsdb.

During density tests, I suppose, a lot of db transactions performed in 10
minutes.  This grows raft log significantly.  Note here that record removal is
a db transaction too that ends up in a raft log, i.e. removing values from the
database doesn't decrease memory consumption.

What could decrease memory consumption is compaction.  But why it doesn't
help?  The reason is implementation of memory management in C.  ovsdb-server
actually calls free() on every raft log entry during compaction, however,
libc doesn't return this memory to the system.  That is just how it works.
libc is able to return memory back to OS, but the application in common case
doesn't have control over this behavior.

In the test from comment #7, two test scenarios was performed one by one,
presumably, without compaction between.  That is why memory was not reused,
it is still inside raft log.  In case compaction happened between tests
(automatically or manually), same memory will be reused without growing
total memory consumption, because this memory is free and available for
allocation, but not given back to OS.


What we can do if we really want to limit memory consumption in this case:

1. Limit maximum number of entries in raft log and trigger compaction once
   threshold reached.  This will ensure that total memory consumptions is
   in some sane boundaries, however, frequent compactions under heavy load
   might slow down processing and introduce performance spikes that will
   probably lead to leadership changes or some other disturbances inside
   the cluster.
   
2. We could keep allowing raft logs to grow, but introduce a special appctl
   command to call 'malloc_trim()' function.  This function tries to return
   all the unused memory from the heap back to OS.  It's glibc specific.
   This command (e.g. 'ovsdb-server/return-unused-memory') could be called
   after 'ovsdb-server/compact' or periodically by the CMS if it thinks
   that ovsdb-server eats too much memory.  I tested this function and it
   actually able to return a lot of memory back to system after compaction.
   
What would be better, more usable?
Tim, what do you think?


P.S. This will not resolve issue with huge backlogs on raft connections.
     So, I'd still like to have testing results of builds from comment #5
     with the original scale level, e.g. similar to what is described there:
     https://bugzilla.redhat.com/show_bug.cgi?id=1887585#c0

Comment 14 Tim Rozet 2020-10-22 22:05:09 UTC
Excellent analysis Ilya! I have the same concern as you that #1 would get us in trouble with perf issues. I think it's reasonable for the CMS to monitor the memory and if it sees it over a threshold then manually trigger a compact and then return-unused-memory. I guess how would we know the last time it was compacted or when it was finished? Maybe we can work together to iron out the details of how this would work.

For the backlog on the raft connections...Raul could you run with the higher scale number that Joe used so we can trigger the backlogs?

We can also try pulling in all the fixes (lower num ports in port group, reject ACL on port group) and try to ensure we dont hit any scale/perf issues.

Comment 15 Ilya Maximets 2020-10-23 14:22:17 UTC
I have a couple of options how to implement memory trimming:

1. Introduce separate appctl like 'ovsdb-server/return-unused-memory'.
   This needs to be called manually/by CMS, but it will be possible to use
   separately from compaction.

2. Add a flag to current 'compact' appctl,
   e.g. 'ovsdb-server/compact --trim-memory'
   This will trigger malloc_trim() if compaction triggered manually/by CMS.
   It will not be possible to trim memory without compaction or if manual
   compaction is not possible at the moment.
   
3. Add a configuration appctl like
   'ovsdb-server/memory-trim-on-compaction (on|off)'.
   Once enabled, each compaction (manual or automatic) will call malloc_trim()
   trying to reclaim all the unused memory back to system.

Tim, which of above looks easier to use or more suitable?

Meanwhile, I'll start implementing option 3, but there is no big implementation
difference between options, so it could be easily changed.


For the question "how would we know the last time it was compacted or when it
was finished", I'm adding 'raft-log-len' counter to the memory report
('memory/show'), that could be used to see how many log entries could be
trimmed off by taking a snapshot/compacting.  That might be a point of data
for CMS to decide if manual compaction needed.
We could also add something like 'Last snapshot time' to the output of
'cluster/status' command.

How useful would that be?

Note that there might be no need in manual compactions if we choose option 3
above, i.e. memory trimming on automatic compactions.

Comment 16 Ilya Maximets 2020-10-26 02:36:15 UTC
I prepared all the patches.  I also fixed few issues with my previous backlog
patches from comment #5, so, please, use these new builds for testing:
  http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.13/2.13.0/62.el7fdp/
  http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.13/2.13.0/72.el8fdp/

To enable memory trimming, use following appctl command on each of running
ovsdb-server processes before the test:
  ovs-appctl -t <target process> ovsdb-server/memory-trim-on-compaction on
  
After that ovsdb-server will always try to return unused memory after DB
compaction.  This usually increases time needed to finish compaction by
1-30% depending on the current state.  But I'm not sure if it's a problem.

By default, raft connections will be dropped if backlog reaches 500 messages
or 4GB in size.  Values could be tweaked with appctl command:
  ovs-appctl -t <target process> cluster/set-backlog-threshold DB n_msgs n_bytes

I also sent all patches upstream for review:
  * Mitigate RAFT memory consumption issues.
    https://patchwork.ozlabs.org/project/openvswitch/list/?series=209911

Comment 17 Tim Rozet 2020-10-26 13:45:06 UTC
Thanks Ilya. So can these commands be issued at anytime while the cluster is running? We have a new db manager process that monitors the ovn dbs in ovn-k8s right now. We could add code there to ensure this gets configured.

Comment 18 Dan Williams 2020-10-26 13:57:05 UTC
(In reply to Tim Rozet from comment #17)
> Thanks Ilya. So can these commands be issued at anytime while the cluster is
> running? We have a new db manager process that monitors the ovn dbs in
> ovn-k8s right now. We could add code there to ensure this gets configured.

Yes, we should have a process that monitors memory usage via memory/show (or something like that) and compacts the DB when usage reaches a given level that we determine.

Comment 19 Ilya Maximets 2020-11-10 14:28:30 UTC
All patches are accepted in upstream master and upstream branch-2.13:
v1 https://patchwork.ozlabs.org/project/openvswitch/list/?series=209911&state=*
v2 https://patchwork.ozlabs.org/project/openvswitch/list/?series=211893&state=*

@Tim, yes, all commands could be issued at any time, i.e. you could enable/disable
memory trimming on compaction and configure thresholds on raft jsonrpc connections.

Comment 20 Sai Sindhur Malleni 2020-11-12 15:07:54 UTC
Tim,

Do you know when these patches will be consumable in a downstream build? Thanks!

Comment 21 Dan Williams 2020-11-12 21:16:52 UTC
There are two parts to fixing this in OpenShift:

1) get updated OVS with the new config options and capability into OCP. Soak-testing here: https://github.com/openshift/ovn-kubernetes/pull/340

2) set defaults in the CNO, and create a memory-watcher-then-compact loop in the dbchecker container (not started)

Comment 23 Dan Williams 2021-03-04 22:29:08 UTC
Both patches landed downstream as of openvswitch2.13-2.13.0-61.el7fdp and openvswitch2.13-2.13.0-70.el8fdp.
They were shipped as part of FDP 20.I


Note You need to log in before you can comment on or make changes to this bug.