Description of problem:
While investigating 1887585 I can see that NBDB memory is extremely high on every node (between 17-20GB), while northd is at 100% cpu on every node. The real disk size of the nbdb is only several MB.
After triggering core dumps by killing northd and nbdb, the new instances have less than 1MB of memory and 0% cpu.
Will attach the logs and dbs. I also have coredumps but they are too big to attach to bz.
Version-Release number of selected component (if applicable):
[root@ip-10-0-159-165 ~]# rpm -qa |grep ovn
[root@ip-10-0-159-165 ~]# rpm -qa |grep openvswitch
Created attachment 1721906 [details]
Looking at the logs the issue is similar to what we fixed in
https://bugzilla.redhat.com/show_bug.cgi?id=1834838. The fix is working
and ovsdb-server actually drops a lot of duplicated requests to install
exactly same snaphots, but one of NB DB servers is not doing well and backlog
is still growing. Not sure what are these raft messages exactly, but
at some point sending backlog on one of the raft connections exceeded 61 GB:
2020-10-15T11:40:12Z|05907|jsonrpc|INFO|Dropped 874 log messages in last 12 seconds (most recently, 0 seconds ago) due to excessive rate
2020-10-15T11:40:12Z|05908|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:10.0.205.228:44384, num of msgs: 15370, backlog: 61731060773.
According to above statistics, it's ~3.8 MB per message that actually looks
suspiciously close to the size of NB DB in this test, so these are still
might be snapshots. Also, at the point where ovsdb-server starts to frequently
report installation of exact same snapshots, backlog quickly starts to go
2020-10-15T11:53:49Z|06276|raft|WARN|not sending exact same install_snapshot_request to server 007c again
2020-10-15T11:53:51Z|06278|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:10.0.205.228:44384, num of msgs: 7679, backlog: 22194409937.
2020-10-15T12:00:49Z|06416|raft|WARN|not sending exact same install_snapshot_request to server 007c again
2020-10-15T12:01:01Z|06420|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:10.0.205.228:44384, num of msgs: 2637, backlog: 507864302.
So, I guess, that it's almost the same issue as in 1834838, but for some
reason we can not detect that snapshots are equal, or snapshots are
actually a little bit different.
Need to investigate if snapshots could be detected as different (or if they
are actually different). Another temporary solution and an emergency measure
for similar issues in the future might be to introduce a threshold and drop
connection if backlog exceeds this threshold to avoid memory exhaustion.
Unfortunately, NB DB server that caused excessive sending backlog on 2 other
servers was restarted and there are no interesting logs for this machine in
the attached archive.
Tim, is it possible to get these logs somehow? It might be good to look
at what happened to the server that it stopped processing requests in time.
Was able to get the old container log. Will attach.
Created attachment 1722679 [details]
previous nbdb log
I prepared a couple of patches for the raft implementation in ovsdb-server
that should prevent growing of backlog on raft connections. Scratch builds
for el8 and el7 are available here:
Tim, could you, please, test with them?
I didn't send these patches upsteam yet because I'm not 100% sure that all of
them are necessary, but they includes some more logging and we will be able
to see a clearer picture of what is going on.
In short, what I did:
1. Limited number of 'install_snapshot_request's to one at a time.
2. Fixed overflow of 'raft-backlog' counter in memory report.
(this one I sent upstream already)
3. Introduced hard limits on the jsonrpc backlog for raft connections.
For now it's limited to 100 messages and 2GB of backlog, but I plan
to make it configurable, if the effectiveness will be proved.
Once backlog limit reached, connection will be dropped and re-established
clearing all the remaining backlog.
All above changes are targeted for excessive backlog issue, however, looking
closely to logs, I see that there is also a complex of various issues that
are, likely, connected with each other. So, it might be not enough to fix
1. Sometimes it takes > 15 seconds for a single polling internal. This means
that there are requests that takes so much time. But election timer for
NB DB is set to 10 seconds. This leads to leader changes and attempts of
a server that processed this kind of request to catch up. It might make
sense to increase election timer at least to 16 seconds as it done for
2. Looking at the SB DB logs from must gather collected for 1887585, it looks
like SB DB itself is very big. It contains ~11M of cells. So, here we,
I suppose, have a lot of reject ACLs. Might make sense to try solution
from https://bugzilla.redhat.com/show_bug.cgi?id=1859924 .
3. In the same must gather archive there is a suspicious repeating log from
|stream_ssl|ERR|SSL_use_PrivateKey_file: error:02001002:system library:fopen:No such file or directory
|stream_ssl|ERR|SSL_use_certificate_file: error:20074002:BIO routines:file_ctrl:system lib
|stream_ssl|ERR|Certificate must be configured to use SSL
|stream_ssl|ERR|Private key must be configured to use SSL
ovn-nbctl daemon successfully started only on one of 3 master nodes.
I'm not sure if this is the issue, but looks suspicious.
Thanks Ilya. I also suspect the service reject ACLs are what triggered this problem:
The fixes for reject ACL in NB are not downstream yet. I think we should try the original scale test that triggered this issue with your new RPM. That way we don't have too many variables and can see if your fix in this RPM has an impact. I'll work with Raul to run the test and report back.
I looked through the new logs and I see that the scale of this test is not
enough to trigger backlog issues, so we're not able to see If any of my
changes have any effect.
Is it possible to run with the original scale?
For the memory consumption that doesn't go away after removing pods and
services, I'm trying to craft a minimal test with jut OVN sandbox to be
able to profile/debug locally.
I prepared following test with ovn-sandbox:
make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered'
# cat ../memory-test.sh
ovs-appctl -t $(pwd)/sandbox/nb1 vlog/set raft:dbg
ovs-appctl -t $(pwd)/sandbox/nb2 vlog/set raft:dbg
ovs-appctl -t $(pwd)/sandbox/nb3 vlog/set raft:dbg
export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off)
ovn-nbctl pg-add $pg_name
ovn-nbctl list port-group
for i in $(seq 1 4000); do
echo "Iteration: $i"
ovn-nbctl --log acl-add $pg_name from-lport $i udp drop
ovs-appctl -t $(pwd)/sandbox/nb1 memory/show
ovn-nbctl acl-del $pg_name
ovn-nbctl pg-del $pg_name
ovs-appctl -t $(pwd)/sandbox/nb1 memory/show
ovn-appctl -t ovn-nbctl exit
It just creates a single port group and adds 4000 acls, after that removes all
the acls and the port group.
I can confirm that after a single run memory grows up to 6-7GB and nbdb doesn't
free this memory after removing everything from the database.
However, if I'm running the same test second time, memory consumption stays the
same, i.e. nbdb reuses already allocated for the first test memory. OTOH, it
seems that it wasn't the case for the test performed in
i.e. nbdb doesn't seem to reuse memory in that test. Looking further into this.
(In reply to Ilya Maximets from comment #11)
> I can confirm that after a single run memory grows up to 6-7GB and nbdb
> free this memory after removing everything from the database.
One correction: This test was performed under memory sanitizer, so the
real memory consumption without it would be 2-3 times lower.
OK. I found what is going on with regards to memory not freed/reused.
RAFT implementation inside ovsdb-server stores the log of operations. This is
required to keep track of which operations are in-flight and which could be
committed. This is a basic element of RAFT.
During compaction all committed log entries are merged into a base snapshot.
However, compaction happens automatically only once in 10 minutes. During this
10 minute period all the transactions accumulated in the log. It could be
tens of thousands of transactions each of which could be fairly large. e.g.
Addition of a single port to a port group that already contains 1000 ports will
take ~100KB in json format. 10K of such transactions will take ~1GB in json
format and more in parsed representation inside ovsdb.
During density tests, I suppose, a lot of db transactions performed in 10
minutes. This grows raft log significantly. Note here that record removal is
a db transaction too that ends up in a raft log, i.e. removing values from the
database doesn't decrease memory consumption.
What could decrease memory consumption is compaction. But why it doesn't
help? The reason is implementation of memory management in C. ovsdb-server
actually calls free() on every raft log entry during compaction, however,
libc doesn't return this memory to the system. That is just how it works.
libc is able to return memory back to OS, but the application in common case
doesn't have control over this behavior.
In the test from comment #7, two test scenarios was performed one by one,
presumably, without compaction between. That is why memory was not reused,
it is still inside raft log. In case compaction happened between tests
(automatically or manually), same memory will be reused without growing
total memory consumption, because this memory is free and available for
allocation, but not given back to OS.
What we can do if we really want to limit memory consumption in this case:
1. Limit maximum number of entries in raft log and trigger compaction once
threshold reached. This will ensure that total memory consumptions is
in some sane boundaries, however, frequent compactions under heavy load
might slow down processing and introduce performance spikes that will
probably lead to leadership changes or some other disturbances inside
2. We could keep allowing raft logs to grow, but introduce a special appctl
command to call 'malloc_trim()' function. This function tries to return
all the unused memory from the heap back to OS. It's glibc specific.
This command (e.g. 'ovsdb-server/return-unused-memory') could be called
after 'ovsdb-server/compact' or periodically by the CMS if it thinks
that ovsdb-server eats too much memory. I tested this function and it
actually able to return a lot of memory back to system after compaction.
What would be better, more usable?
Tim, what do you think?
P.S. This will not resolve issue with huge backlogs on raft connections.
So, I'd still like to have testing results of builds from comment #5
with the original scale level, e.g. similar to what is described there:
Excellent analysis Ilya! I have the same concern as you that #1 would get us in trouble with perf issues. I think it's reasonable for the CMS to monitor the memory and if it sees it over a threshold then manually trigger a compact and then return-unused-memory. I guess how would we know the last time it was compacted or when it was finished? Maybe we can work together to iron out the details of how this would work.
For the backlog on the raft connections...Raul could you run with the higher scale number that Joe used so we can trigger the backlogs?
We can also try pulling in all the fixes (lower num ports in port group, reject ACL on port group) and try to ensure we dont hit any scale/perf issues.
I have a couple of options how to implement memory trimming:
1. Introduce separate appctl like 'ovsdb-server/return-unused-memory'.
This needs to be called manually/by CMS, but it will be possible to use
separately from compaction.
2. Add a flag to current 'compact' appctl,
e.g. 'ovsdb-server/compact --trim-memory'
This will trigger malloc_trim() if compaction triggered manually/by CMS.
It will not be possible to trim memory without compaction or if manual
compaction is not possible at the moment.
3. Add a configuration appctl like
Once enabled, each compaction (manual or automatic) will call malloc_trim()
trying to reclaim all the unused memory back to system.
Tim, which of above looks easier to use or more suitable?
Meanwhile, I'll start implementing option 3, but there is no big implementation
difference between options, so it could be easily changed.
For the question "how would we know the last time it was compacted or when it
was finished", I'm adding 'raft-log-len' counter to the memory report
('memory/show'), that could be used to see how many log entries could be
trimmed off by taking a snapshot/compacting. That might be a point of data
for CMS to decide if manual compaction needed.
We could also add something like 'Last snapshot time' to the output of
How useful would that be?
Note that there might be no need in manual compactions if we choose option 3
above, i.e. memory trimming on automatic compactions.
I prepared all the patches. I also fixed few issues with my previous backlog
patches from comment #5, so, please, use these new builds for testing:
To enable memory trimming, use following appctl command on each of running
ovsdb-server processes before the test:
ovs-appctl -t <target process> ovsdb-server/memory-trim-on-compaction on
After that ovsdb-server will always try to return unused memory after DB
compaction. This usually increases time needed to finish compaction by
1-30% depending on the current state. But I'm not sure if it's a problem.
By default, raft connections will be dropped if backlog reaches 500 messages
or 4GB in size. Values could be tweaked with appctl command:
ovs-appctl -t <target process> cluster/set-backlog-threshold DB n_msgs n_bytes
I also sent all patches upstream for review:
* Mitigate RAFT memory consumption issues.
Thanks Ilya. So can these commands be issued at anytime while the cluster is running? We have a new db manager process that monitors the ovn dbs in ovn-k8s right now. We could add code there to ensure this gets configured.
(In reply to Tim Rozet from comment #17)
> Thanks Ilya. So can these commands be issued at anytime while the cluster is
> running? We have a new db manager process that monitors the ovn dbs in
> ovn-k8s right now. We could add code there to ensure this gets configured.
Yes, we should have a process that monitors memory usage via memory/show (or something like that) and compacts the DB when usage reaches a given level that we determine.
All patches are accepted in upstream master and upstream branch-2.13:
@Tim, yes, all commands could be issued at any time, i.e. you could enable/disable
memory trimming on compaction and configure thresholds on raft jsonrpc connections.
Do you know when these patches will be consumable in a downstream build? Thanks!
There are two parts to fixing this in OpenShift:
1) get updated OVS with the new config options and capability into OCP. Soak-testing here: https://github.com/openshift/ovn-kubernetes/pull/340
2) set defaults in the CNO, and create a memory-watcher-then-compact loop in the dbchecker container (not started)
Both patches landed downstream as of openvswitch2.13-2.13.0-61.el7fdp and openvswitch2.13-2.13.0-70.el8fdp.
They were shipped as part of FDP 20.I