Description of problem: While investigating 1887585 I can see that NBDB memory is extremely high on every node (between 17-20GB), while northd is at 100% cpu on every node. The real disk size of the nbdb is only several MB. After triggering core dumps by killing northd and nbdb, the new instances have less than 1MB of memory and 0% cpu. Will attach the logs and dbs. I also have coredumps but they are too big to attach to bz. Version-Release number of selected component (if applicable): [root@ip-10-0-159-165 ~]# rpm -qa |grep ovn ovn2.13-20.06.2-11.el8fdp.x86_64 ovn2.13-host-20.06.2-11.el8fdp.x86_64 ovn2.13-central-20.06.2-11.el8fdp.x86_64 ovn2.13-vtep-20.06.2-11.el8fdp.x86_64 [root@ip-10-0-159-165 ~]# rpm -qa |grep openvswitch openvswitch2.13-2.13.0-57.el8fdp.x86_64 openvswitch2.13-devel-2.13.0-57.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch
Created attachment 1721906 [details] dbs, logs
Looking at the logs the issue is similar to what we fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1834838. The fix is working and ovsdb-server actually drops a lot of duplicated requests to install exactly same snaphots, but one of NB DB servers is not doing well and backlog is still growing. Not sure what are these raft messages exactly, but at some point sending backlog on one of the raft connections exceeded 61 GB: 2020-10-15T11:40:12Z|05907|jsonrpc|INFO|Dropped 874 log messages in last 12 seconds (most recently, 0 seconds ago) due to excessive rate 2020-10-15T11:40:12Z|05908|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:10.0.205.228:44384, num of msgs: 15370, backlog: 61731060773. According to above statistics, it's ~3.8 MB per message that actually looks suspiciously close to the size of NB DB in this test, so these are still might be snapshots. Also, at the point where ovsdb-server starts to frequently report installation of exact same snapshots, backlog quickly starts to go down: 2020-10-15T11:53:49Z|06276|raft|WARN|not sending exact same install_snapshot_request to server 007c again 2020-10-15T11:53:51Z|06278|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:10.0.205.228:44384, num of msgs: 7679, backlog: 22194409937. ... 2020-10-15T12:00:49Z|06416|raft|WARN|not sending exact same install_snapshot_request to server 007c again 2020-10-15T12:01:01Z|06420|jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:10.0.205.228:44384, num of msgs: 2637, backlog: 507864302. So, I guess, that it's almost the same issue as in 1834838, but for some reason we can not detect that snapshots are equal, or snapshots are actually a little bit different. Need to investigate if snapshots could be detected as different (or if they are actually different). Another temporary solution and an emergency measure for similar issues in the future might be to introduce a threshold and drop connection if backlog exceeds this threshold to avoid memory exhaustion. Unfortunately, NB DB server that caused excessive sending backlog on 2 other servers was restarted and there are no interesting logs for this machine in the attached archive. Tim, is it possible to get these logs somehow? It might be good to look at what happened to the server that it stopped processing requests in time.
Was able to get the old container log. Will attach.
Created attachment 1722679 [details] previous nbdb log
I prepared a couple of patches for the raft implementation in ovsdb-server that should prevent growing of backlog on raft connections. Scratch builds for el8 and el7 are available here: http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.13/2.13.0/67.el8fdp/ http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.13/2.13.0/57.el7fdp/ Tim, could you, please, test with them? I didn't send these patches upsteam yet because I'm not 100% sure that all of them are necessary, but they includes some more logging and we will be able to see a clearer picture of what is going on. In short, what I did: 1. Limited number of 'install_snapshot_request's to one at a time. 2. Fixed overflow of 'raft-backlog' counter in memory report. (this one I sent upstream already) 3. Introduced hard limits on the jsonrpc backlog for raft connections. For now it's limited to 100 messages and 2GB of backlog, but I plan to make it configurable, if the effectiveness will be proved. Once backlog limit reached, connection will be dropped and re-established clearing all the remaining backlog. All above changes are targeted for excessive backlog issue, however, looking closely to logs, I see that there is also a complex of various issues that are, likely, connected with each other. So, it might be not enough to fix backlogs. Few observations: 1. Sometimes it takes > 15 seconds for a single polling internal. This means that there are requests that takes so much time. But election timer for NB DB is set to 10 seconds. This leads to leader changes and attempts of a server that processed this kind of request to catch up. It might make sense to increase election timer at least to 16 seconds as it done for Southbound DB. 2. Looking at the SB DB logs from must gather collected for 1887585, it looks like SB DB itself is very big. It contains ~11M of cells. So, here we, I suppose, have a lot of reject ACLs. Might make sense to try solution from https://bugzilla.redhat.com/show_bug.cgi?id=1859924 . 3. In the same must gather archive there is a suspicious repeating log from ovn-nbctl daemons: |stream_ssl|ERR|SSL_use_PrivateKey_file: error:02001002:system library:fopen:No such file or directory |stream_ssl|ERR|SSL_use_certificate_file: error:20074002:BIO routines:file_ctrl:system lib |stream_ssl|ERR|Certificate must be configured to use SSL |stream_ssl|ERR|Private key must be configured to use SSL ovn-nbctl daemon successfully started only on one of 3 master nodes. I'm not sure if this is the issue, but looks suspicious.
Thanks Ilya. I also suspect the service reject ACLs are what triggered this problem: https://bugzilla.redhat.com/show_bug.cgi?id=1887585#c9 The fixes for reject ACL in NB are not downstream yet. I think we should try the original scale test that triggered this issue with your new RPM. That way we don't have too many variables and can see if your fix in this RPM has an impact. I'll work with Raul to run the test and report back.
I looked through the new logs and I see that the scale of this test is not enough to trigger backlog issues, so we're not able to see If any of my changes have any effect. Is it possible to run with the original scale? For the memory consumption that doesn't go away after removing pods and services, I'm trying to craft a minimal test with jut OVN sandbox to be able to profile/debug locally.
I prepared following test with ovn-sandbox: make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered' # cat ../memory-test.sh pg_name=my_port_group_$1 ovs-appctl -t $(pwd)/sandbox/nb1 vlog/set raft:dbg ovs-appctl -t $(pwd)/sandbox/nb2 vlog/set raft:dbg ovs-appctl -t $(pwd)/sandbox/nb3 vlog/set raft:dbg export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off) ovn-nbctl pg-add $pg_name ovn-nbctl list port-group for i in $(seq 1 4000); do date echo "Iteration: $i" ovn-nbctl --log acl-add $pg_name from-lport $i udp drop ovs-appctl -t $(pwd)/sandbox/nb1 memory/show done ovn-nbctl acl-del $pg_name ovn-nbctl pg-del $pg_name ovs-appctl -t $(pwd)/sandbox/nb1 memory/show ovn-appctl -t ovn-nbctl exit It just creates a single port group and adds 4000 acls, after that removes all the acls and the port group. I can confirm that after a single run memory grows up to 6-7GB and nbdb doesn't free this memory after removing everything from the database. However, if I'm running the same test second time, memory consumption stays the same, i.e. nbdb reuses already allocated for the first test memory. OTOH, it seems that it wasn't the case for the test performed in https://bugzilla.redhat.com/show_bug.cgi?id=1888829#c7 i.e. nbdb doesn't seem to reuse memory in that test. Looking further into this.
(In reply to Ilya Maximets from comment #11) > I can confirm that after a single run memory grows up to 6-7GB and nbdb > doesn't > free this memory after removing everything from the database. One correction: This test was performed under memory sanitizer, so the real memory consumption without it would be 2-3 times lower.
OK. I found what is going on with regards to memory not freed/reused. RAFT implementation inside ovsdb-server stores the log of operations. This is required to keep track of which operations are in-flight and which could be committed. This is a basic element of RAFT. During compaction all committed log entries are merged into a base snapshot. However, compaction happens automatically only once in 10 minutes. During this 10 minute period all the transactions accumulated in the log. It could be tens of thousands of transactions each of which could be fairly large. e.g. Addition of a single port to a port group that already contains 1000 ports will take ~100KB in json format. 10K of such transactions will take ~1GB in json format and more in parsed representation inside ovsdb. During density tests, I suppose, a lot of db transactions performed in 10 minutes. This grows raft log significantly. Note here that record removal is a db transaction too that ends up in a raft log, i.e. removing values from the database doesn't decrease memory consumption. What could decrease memory consumption is compaction. But why it doesn't help? The reason is implementation of memory management in C. ovsdb-server actually calls free() on every raft log entry during compaction, however, libc doesn't return this memory to the system. That is just how it works. libc is able to return memory back to OS, but the application in common case doesn't have control over this behavior. In the test from comment #7, two test scenarios was performed one by one, presumably, without compaction between. That is why memory was not reused, it is still inside raft log. In case compaction happened between tests (automatically or manually), same memory will be reused without growing total memory consumption, because this memory is free and available for allocation, but not given back to OS. What we can do if we really want to limit memory consumption in this case: 1. Limit maximum number of entries in raft log and trigger compaction once threshold reached. This will ensure that total memory consumptions is in some sane boundaries, however, frequent compactions under heavy load might slow down processing and introduce performance spikes that will probably lead to leadership changes or some other disturbances inside the cluster. 2. We could keep allowing raft logs to grow, but introduce a special appctl command to call 'malloc_trim()' function. This function tries to return all the unused memory from the heap back to OS. It's glibc specific. This command (e.g. 'ovsdb-server/return-unused-memory') could be called after 'ovsdb-server/compact' or periodically by the CMS if it thinks that ovsdb-server eats too much memory. I tested this function and it actually able to return a lot of memory back to system after compaction. What would be better, more usable? Tim, what do you think? P.S. This will not resolve issue with huge backlogs on raft connections. So, I'd still like to have testing results of builds from comment #5 with the original scale level, e.g. similar to what is described there: https://bugzilla.redhat.com/show_bug.cgi?id=1887585#c0
Excellent analysis Ilya! I have the same concern as you that #1 would get us in trouble with perf issues. I think it's reasonable for the CMS to monitor the memory and if it sees it over a threshold then manually trigger a compact and then return-unused-memory. I guess how would we know the last time it was compacted or when it was finished? Maybe we can work together to iron out the details of how this would work. For the backlog on the raft connections...Raul could you run with the higher scale number that Joe used so we can trigger the backlogs? We can also try pulling in all the fixes (lower num ports in port group, reject ACL on port group) and try to ensure we dont hit any scale/perf issues.
I have a couple of options how to implement memory trimming: 1. Introduce separate appctl like 'ovsdb-server/return-unused-memory'. This needs to be called manually/by CMS, but it will be possible to use separately from compaction. 2. Add a flag to current 'compact' appctl, e.g. 'ovsdb-server/compact --trim-memory' This will trigger malloc_trim() if compaction triggered manually/by CMS. It will not be possible to trim memory without compaction or if manual compaction is not possible at the moment. 3. Add a configuration appctl like 'ovsdb-server/memory-trim-on-compaction (on|off)'. Once enabled, each compaction (manual or automatic) will call malloc_trim() trying to reclaim all the unused memory back to system. Tim, which of above looks easier to use or more suitable? Meanwhile, I'll start implementing option 3, but there is no big implementation difference between options, so it could be easily changed. For the question "how would we know the last time it was compacted or when it was finished", I'm adding 'raft-log-len' counter to the memory report ('memory/show'), that could be used to see how many log entries could be trimmed off by taking a snapshot/compacting. That might be a point of data for CMS to decide if manual compaction needed. We could also add something like 'Last snapshot time' to the output of 'cluster/status' command. How useful would that be? Note that there might be no need in manual compactions if we choose option 3 above, i.e. memory trimming on automatic compactions.
I prepared all the patches. I also fixed few issues with my previous backlog patches from comment #5, so, please, use these new builds for testing: http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.13/2.13.0/62.el7fdp/ http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.13/2.13.0/72.el8fdp/ To enable memory trimming, use following appctl command on each of running ovsdb-server processes before the test: ovs-appctl -t <target process> ovsdb-server/memory-trim-on-compaction on After that ovsdb-server will always try to return unused memory after DB compaction. This usually increases time needed to finish compaction by 1-30% depending on the current state. But I'm not sure if it's a problem. By default, raft connections will be dropped if backlog reaches 500 messages or 4GB in size. Values could be tweaked with appctl command: ovs-appctl -t <target process> cluster/set-backlog-threshold DB n_msgs n_bytes I also sent all patches upstream for review: * Mitigate RAFT memory consumption issues. https://patchwork.ozlabs.org/project/openvswitch/list/?series=209911
Thanks Ilya. So can these commands be issued at anytime while the cluster is running? We have a new db manager process that monitors the ovn dbs in ovn-k8s right now. We could add code there to ensure this gets configured.
(In reply to Tim Rozet from comment #17) > Thanks Ilya. So can these commands be issued at anytime while the cluster is > running? We have a new db manager process that monitors the ovn dbs in > ovn-k8s right now. We could add code there to ensure this gets configured. Yes, we should have a process that monitors memory usage via memory/show (or something like that) and compacts the DB when usage reaches a given level that we determine.
All patches are accepted in upstream master and upstream branch-2.13: v1 https://patchwork.ozlabs.org/project/openvswitch/list/?series=209911&state=* v2 https://patchwork.ozlabs.org/project/openvswitch/list/?series=211893&state=* @Tim, yes, all commands could be issued at any time, i.e. you could enable/disable memory trimming on compaction and configure thresholds on raft jsonrpc connections.
Tim, Do you know when these patches will be consumable in a downstream build? Thanks!
There are two parts to fixing this in OpenShift: 1) get updated OVS with the new config options and capability into OCP. Soak-testing here: https://github.com/openshift/ovn-kubernetes/pull/340 2) set defaults in the CNO, and create a memory-watcher-then-compact loop in the dbchecker container (not started)
Both patches landed downstream as of openvswitch2.13-2.13.0-61.el7fdp and openvswitch2.13-2.13.0-70.el8fdp. They were shipped as part of FDP 20.I