Description of problem: we have tried running cluster-density(mastervertical) test with 1000 projects/100-node-scale on ocp4.5rc2, ocp4.5rc6 and ocp4.5rc7 with ovn as the overlay network. the test consistently failed to create pods after a certain time with the error #1. The openshift-apiservers pods were out at this point(openshift-api-server.events #2) (apiserver.logs #3) (kubelet.log #4)(journal.log #5) we have tried the above test with IPAM fix #6 and it still fails >#1 Warning FailedCreatePodSandBox 127m kubelet, ip-10-0-214-114.us-west-2.compute.internal Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deploymentconfig0-1-deploy_mastervertical190_5b214ef7-20f7-4406-a98e-1777d250872a_0(65d29eeca021420497e7f6285d8d5d9bb9d3b7acae7d882957db550171b0daca): Multus: [mastervertical190/deploymentconfig0-1-deploy]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking confAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[mastervertical190/deploymentconfig0-1-deploy] failed to get pod annotation: timed out waiting for the condition #2 http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.5-ovn/fubar_cluster_logs/openshift-apiserver.events #3 http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.5-ovn/fubar_cluster_logs/apiserver.log #4 http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.5-ovn/fubar_cluster_logs/kubelet.log #5 http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.5-ovn/fubar_cluster_logs/journal.log #6 https://github.com/ovn-org/ovn-kubernetes/pull/1365 Version-Release number of selected component (if applicable): ocp4.5rc2, ocp4.5rc6 and ocp4.5rc7 How reproducible: Always Steps to Reproduce: 1. Install cluster using 4.5.0-rc.7 2. Deploy a 100 nodes cluster 3. Run Cluster-density (Mastervertical) with at least 1000 namespaces Actual results: Pod creation errors starting from namespaces Expected results: All pods and namespaces are created sucessfully Additional info:
Need the ovnkube-master logs and node logs here. A must-gather for the cluster would be even better.
(In reply to Dan Williams from comment #1) > Need the ovnkube-master logs and node logs here. A must-gather for the cluster would be even better. The must-gather pod did not start. The other logs can be found here http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.5-ovn/bz_logs/
The main issue we observe is that during a period of heavy object creation that involves OVN operations (pods and services), memory consumption from the OVN components grows dramatically, this situation turns out with the Kernel executing OOM killer over other control plane processes. $ oc adm top pods -n openshift-ovn-kubernetes -l app=ovnkube-master -n openshift-ovn-kubernetes --containers POD NAME CPU(cores) MEMORY(bytes) ovnkube-master-579sg northd 6m 551Mi ovnkube-master-579sg nbdb 8m 13356Mi ovnkube-master-579sg ovnkube-master 4m 134Mi ovnkube-master-579sg sbdb 8m 2310Mi ovnkube-master-j9vbz nbdb 9m 8789Mi ovnkube-master-j9vbz northd 983m 728Mi ovnkube-master-j9vbz ovnkube-master 6m 130Mi ovnkube-master-j9vbz sbdb 20m 3301Mi ovnkube-master-mnbkk northd 6m 551Mi ovnkube-master-mnbkk ovnkube-master 64m 248Mi ovnkube-master-mnbkk sbdb 4m 2277Mi ovnkube-master-mnbkk nbdb 24m 7463Mi During this object creation storm, we also observe leader election failures like the below: $ oc logs -f ovnkube-master-579sg -c nbdb --tail 100 | grep error 2020-07-10T15:17:48Z|00423|jsonrpc|WARN|unix#134: send error: Broken pipe 2020-07-10T15:17:48Z|00426|stream_ssl|WARN|SSL_write: system error (Broken pipe) 2020-07-10T15:17:48Z|00427|jsonrpc|WARN|ssl:10.0.134.4:57640: send error: Broken pipe 2020-07-10T15:17:48Z|00428|stream_ssl|WARN|SSL_write: system error (Broken pipe) 2020-07-10T15:17:48Z|00429|jsonrpc|WARN|ssl:10.0.192.134:57768: send error: Broken pipe 2020-07-10T15:17:48Z|00430|stream_ssl|WARN|SSL_write: system error (Broken pipe) 2020-07-10T15:17:48Z|00431|jsonrpc|WARN|ssl:10.0.134.4:57774: send error: Broken pipe 2020-07-10T15:17:48Z|00432|stream_ssl|WARN|SSL_write: system error (Broken pipe) 2020-07-10T15:17:48Z|00433|jsonrpc|WARN|ssl:10.0.168.157:59964: send error: Broken pipe 2020-07-10T15:17:48Z|00434|stream_ssl|WARN|SSL_write: system error (Broken pipe) 2020-07-10T15:17:48Z|00435|stream_ssl|WARN|SSL_write: system error (Broken pipe) $ oc logs -f ovnkube-master-579sg -c sbdb --tail 100 | grep error 2020-07-10T15:10:49Z|00044|jsonrpc|WARN|ssl:10.0.192.134:37434: receive error: Protocol error 2020-07-10T15:10:49Z|00046|reconnect|WARN|ssl:10.0.192.134:37434: connection dropped (Protocol error) 2020-07-10T15:10:49Z|00048|jsonrpc|WARN|ssl:10.0.168.157:33454: receive error: Protocol error 2020-07-10T15:10:49Z|00050|reconnect|WARN|ssl:10.0.168.157:33454: connection dropped (Protocol error) 2020-07-10T15:22:37Z|00070|jsonrpc|WARN|unix#146: send error: Broken pipe 2020-07-10T15:22:37Z|00072|jsonrpc|WARN|ssl:10.0.192.134:45446: receive error: Protocol error 2020-07-10T15:22:37Z|00074|reconnect|WARN|ssl:10.0.192.134:45446: connection dropped (Protocol error)
Created attachment 1700619 [details] OVN logs OVN pods logs attached were taken after deploying using a loop. 2000 Deployments with 1 replica + 2000 services
13G seems excessive for the config of this size for nbdb. So our hypothesis is nbdb pod grows in size and we end up kill openshift-apiserver pods to allow for more memory? That seems odd. Anil, can you please make sure the ovn and ovs rpm versions we are running on this set up have the fixes from Ilya and the json-rpc inactivity timer disabled?
Found on a 4.5.0-rc7 aws/ovn cluster: openvswitch2.13-2.13.0-29.el7fdp.x86_64 ovn2.13-2.13.0-31.el7fdp.x86_64 which corresponds to the versions tagged into rhaos-4.5-rhel-7-candidate via our SCR. That OVS version *does* contain the various RAFT fixes: * Fri Jun 05 2020 Ilya Maximets <i.maximets> - 2.13.0-29 - raft: Avoid sending equal snapshots. (#1834838) [65861440ed1baaa8dbe1148a5df230f3d8595e20] * Fri Jun 05 2020 Ilya Maximets <i.maximets> - 2.13.0-28 - ovsdb-server: Fix schema leak while reading db. (#1834838) [13f16c833f38da9c7bdec3615eed90eef5449239] * Fri Jun 05 2020 Ilya Maximets <i.maximets> - 2.13.0-27 - ovsdb: Add raft memory usage to memory report. (#1834838) [a9d4cc1ca8060856fd296fa80d1d5edef131e60a] * Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-24 - raft: Disable RAFT jsonrpc inactivity probe. (#1822290) [b12acf45a6872dda85642cbc73dd86eb529be17e] * Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-23 - raft: Fix leak of the incomplete command. (#1835729) [bb552cffb89104c2bb19b8aff749b8b825a6db13]
nb-db server is definitely under memory stress here, but unfortunately the log file for the nb-db container that grows to 13Gig is corrupted so we can't figure out much out of that. But looking at the other nb-db pods it shows that (1) nb-db was at the 5 second election-timer value (reset from 16 second to 5 second during the restart ) ```2020-07-10T14:58:29Z|00025|raft|INFO|Election timer changed from 16000 to 5000``` (2) This caused 4 partition during the 6-8 minute of workload test (3) nb-db memory consume increased from 800m to 8G (on follower nodes, master node log is corrupted) during the 7 minutes tests ``` 2020-07-10T15:06:32Z|00026|memory|INFO|peak resident set size grew 56% in last 500.4 seconds, from 537608 kB to 838876 kB 2020-07-10T15:06:32Z|00027|memory|INFO|cells:27051 monitors:0 raft-connections:4 2020-07-10T15:07:22Z|00028|memory|INFO|peak resident set size grew 54% in last 50.0 seconds, from 838876 kB to 1289076 kB 2020-07-10T15:07:22Z|00029|memory|INFO|cells:30586 monitors:0 raft-connections:4 2020-07-10T15:08:22Z|00030|memory|INFO|peak resident set size grew 56% in last 60.0 seconds, from 1289076 kB to 2013316 kB 2020-07-10T15:08:22Z|00031|memory|INFO|cells:34495 monitors:0 raft-connections:4 2020-07-10T15:09:32Z|00032|memory|INFO|peak resident set size grew 51% in last 70.0 seconds, from 2013316 kB to 3042916 kB 2020-07-10T15:09:32Z|00033|memory|INFO|cells:39036 monitors:0 raft-connections:4 2020-07-10T15:11:02Z|00034|memory|INFO|peak resident set size grew 54% in last 90.0 seconds, from 3042916 kB to 4680016 kB 2020-07-10T15:11:02Z|00035|memory|INFO|cells:43987 monitors:0 raft-connections:4 2020-07-10T15:12:52Z|00036|memory|INFO|peak resident set size grew 54% in last 110.0 seconds, from 4680016 kB to 7197376 kB 2020-07-10T15:12:52Z|00037|memory|INFO|cells:49842 monitors:0 raft-connections:4 ``` Seems like nb-db hashmap is bloating aggresively and expanding ``` 2020-07-10T15:14:40Z|00047|coverage|INFO|hmap_expand 1139.6/sec 470.000/sec 122.1694/sec total: 439810 2020-07-10T15:14:40Z|00056|coverage|INFO|util_xalloc 210649.0/sec 728463.650/sec 85936.6717/sec total: 309372018 ``` We need to profile the nb-db memory module to see what's causing this bloating. I am working on recreating this issue, once we have more details on this memory bloating issue, we need to involve ovn team to further dig into it.
This bug is somewhat similar to the following BZ https://bugzilla.redhat.com/show_bug.cgi?id=1838334 Bug reports similar issue but during the horizontal scaling of number of nodes. At scale, It ends up killing sb-db pod and the system never recorvers. Given that underneath nb-db/sb-db has the same code base, the root cause might be same.
This scale test surfaced multiple issues related to memory bloating. I recreated the issue twice on my scale setup and it's not just the pods that's running on master nodes are bloating, but the ovn pods running on worker nodes are bloating as well (consuming around 20G). I opened following two bugs that needs to be addressed for this scale test to pass. https://bugzilla.redhat.com/show_bug.cgi?id=1859883 https://bugzilla.redhat.com/show_bug.cgi?id=1859924
Lets use this bug to track only nbdb memory bloating issue.
Recreated the issue. The moment we hit the scale test issue, following was the memory consumption of nbdb containers ovnkube-master-49429 nbdb 2m 12087Mi ovnkube-master-dd7gs nbdb 2m 16545Mi ovnkube-master-jhjwf nbdb 4m 12472Mi and the nb db file size was -rw-r--r--. 1 root root 51M Aug 7 00:50 ./ovnkube-master-49429/ovnnb_db.db -rw-r--r--. 1 root root 13M Aug 7 00:50 ./ovnkube-master-dd7gs/ovnnb_db.db -rw-r--r--. 1 root root 52M Aug 7 00:53 ./ovnkube-master-jhjwf/ovnnb_db.db Test were using following ovn version. sh-4.2# rpm -qa | grep ovn ovn2.13-2.13.0-31.el7fdp.x86_64 ovn2.13-central-2.13.0-31.el7fdp.x86_64 ovn2.13-host-2.13.0-31.el7fdp.x86_64 ovn2.13-vtep-2.13.0-31.el7fdp.x86_64 Log files and db files are present in the log attachment. Also i periodically (5 minutes) collected cluster status and db file sizes to see the partitions and file size growth.
uploaded the logs here as bugzilla doesn't allow more than 20M file https://drive.google.com/file/d/18dIf6qNP3IQvQOlVAOVjH6ppZuF-jKoV/view?usp=sharing
Thanks, Anil. I'm looking at logs now. Could you, please, also confirm that it uses openvswitch2.13-2.13.0-29.el7fdp.x86_64 ?
Hi llya, yes, it's using the same version # rpm -qa| grep openvswitch openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch openvswitch2.13-2.13.0-29.el7fdp.x86_64 openvswitch2.13-devel-2.13.0-29.el7fdp.x86_64
Short summary of the email thread discussion: The main part of the Northbound DB size is just several times repeated ACLs for each logical switch. There are 100 logical switches in this test and each of them has 1980 ACLs. All of these ACLs are same for each LS. Just to sore them in a fully compacted DB it takes ~9.5 MB. One transaction to add/remove one ACl from all logical switches takes approximately same amount of data constructed, sent, parsed, replicated within a RAFT cluster, stored on disk. In order to add/remove all of this ACLs it takes ~9.4 GB of traffic. That actually grows the size of NB DB up to several GB during the test and might be one of the things that contributes to the process memory consumption, since all of this data is stored somewhere in the process memory at some point. Solution for this might be to use port groups for these kind of ACLs in ovn-k8s instead of adding ACLs to each logical switch. This way we will specify the list of ACLs only once in DB saving a lot of DB space and, probably, a bit of processing time. These are reject ACLs that looks like this: ovs-nbctl: run --id=@acl -- create acl direction=from-lport priority=1000 \"match=\\\"ip4.dst==172.30.246.221 && tcp && tcp.dst==5443\\\"\" action=reject \"name=98416181-8369-413c-aefb-6bb940850568-172.30.246.221\\\\:5443\" -- add logical_switch cf601302-50f3-4b93-ae3c-541817e3c20f acls @acl -- add logical_switch f7895104-517f-4f79-b8c7-7cb2f771d612 acls @acl -- add logical_switch 41055f06-5098-42fc-bb05-9c08c4d1c248 acls @acl ... Anil is going to cook a patch to use Port Group for these in ovn-k8s and test. BTW, it's not clear if this is the root cause of the high memory consumption, but we'll see, once patch and testing done. This is a good change anyway.
Hi, Anil. Did you have a chance to test memory consumption with Port Groups used for reject ACLs? I see the work-in-progress pull request created by Dan: https://github.com/openshift/ovn-kubernetes/pull/233 But there was no activity for a long time.
Once ovn-k8s moves to port-group to apply the LB deny ACL's, cleaning up the old ACL's during the upgrade can be bit tricky. We do use name for these ACL's (lb-<source-ip>-<source-port>), which can be used to clean-up the related old ACL's from Logical_Switch acls column. But this can be bit compute intensive if the upgrade is triggered for the large scale cluster with services in thousands. Current plan is to create port-group per node and move all the Logical_switch ACLs to the port group (including lb, egressfire, policy). So during the upgrade, to cleanup the old ACLs' it just need to clean the acl's column in Logical_Switch table. That would be cleaner approach for upgrade and it will give better structure going forward to deal with similar kind of ACLs.
During irc discussion with Anil we concluded that port group per node will not solve the issue because we will have all the same repeated ACLs but in a different db table. So, it should be a single cluster-wide port group for reject ACls. Reassigning this to Anil as he is working on the fix. Also changing the component to ovn-kubernetes as the fix should be implemented there.
Following PR contains the patch for the issue. Currently trying to test it at higher scale and profile the nb/sb db stats. https://github.com/ovn-org/ovn-kubernetes/pull/1711
Hi Ilya, I ran scale test with the above PR (applying ACL's on PortGroup). This help reducing the nbdb memory consumption significantly. I ran the same test, and this time the test passed. nbdb memory consumption was as follows for all the nodes: ``` ovnkube-master-c2tkj: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 0 4989868 4.6g 8168 S 0.0 3.7 20:52.34 ovsdb-server -vconsole:info -vfile:off --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --un+ ovnkube-master-tb9rt (master): PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 0 5471368 5.0g 8036 S 0.0 4.0 24:02.01 ovsdb-server -vconsole:info -vfile:off --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --un+ ovnkube-master-vf4r7: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 0 4745824 4.3g 8152 S 0.0 3.5 7:07.13 ovsdb-server -vconsole:info -vfile:off --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --un+ ``` So overall memory consumption with the above patch and scale of (100 nodes, 1000 project, 3000+services, 16300 pods) is around 5G now for nbdb, compared to 12-16G. At this point i am not sure whether 5G memory is expected consumption for nbdb process at this scale or not. I was not able to collect memory consumption stats periodically, so i am not sure the rate of growth, so i am not sure if the memory consumption settled down or was it constantly growing with the number of projects/services. I will profile that in the next scale test. I collected logs (db, other logs) from this scale test and uploaded to the following link. Please have a look and if something that can be further improved to reduce this memory consumption further, please let us know. https://drive.google.com/file/d/1hSq5f93kNp1PwY81fifWrCuaKnhaymjh/view?usp=sharing
(In reply to Anil Vishnoi from comment #27) > Hi Ilya, > > I ran scale test with the above PR (applying ACL's on PortGroup). This help > reducing the nbdb memory consumption significantly. I ran the same test, and > this time the test passed. nbdb memory consumption was as follows for all > the nodes: > ... > > So overall memory consumption with the above patch and scale of (100 nodes, > 1000 project, 3000+services, 16300 pods) is around 5G now for nbdb, compared > to 12-16G. > > At this point i am not sure whether 5G memory is expected consumption for > nbdb process at this scale or not. > > I was not able to collect memory consumption stats periodically, so i am not > sure the rate of growth, so i am not sure if the memory consumption settled > down or was it constantly growing with the number of projects/services. I > will profile that in the next scale test. > > I collected logs (db, other logs) from this scale test and uploaded to the > following link. Please have a look and if something that can be further > improved to reduce this memory consumption further, please let us know. Hi. Thanks for this data. I looked at Northbound database files and that is already a good improvement. After your change the main things that contributes to memory consumption are transactions that adds ports to this one port group. Looking at the database in the end of the test there are ~16K ports in this group. One transaction to add or remove one port from this port group generates 700 KB transaction. Addition of all 16K ports one by one generates ~5-6 GB of traffic. Taking into account that this traffic is reduplicated within raft cluster and parsed in a few places while executing/storing transactions, compacting the database. So, I could imagine that 6GB of process memory might be used at some point. This doesn't mean that this could not be improved by reclaiming some memory back to system, but still value seems reasonable. If you want to further reduce memory consumption there are possible solutions: 1. Batching of these transaction. If it's possible to add all ports of one service in one transaction, this will reduce amount of traffic significantly. i.e. instead of 16K transactions (one per port) we will have only 3K transactions (one per service). e.g. ~5x improvement (speculative number). 2. Dumitru pointed out that there is no need to add all ports to this port group just because OVN currently creates ACLs per logical switch and not per port. So, you only need one port per logical switch being added to this port group. Dumitru said that there is a management port in each logical switch, so it should be possible to add only this one port to the port group and remove this port from the port group in case where are no more ports that needs reject acls on this logical switch. Dumitru might correct me if I didn't get this correctly.
(In reply to Ilya Maximets from comment #31) > (In reply to Anil Vishnoi from comment #27) > > Hi Ilya, > > > > I ran scale test with the above PR (applying ACL's on PortGroup). This help > > reducing the nbdb memory consumption significantly. I ran the same test, and > > this time the test passed. nbdb memory consumption was as follows for all > > the nodes: > > ... > > > > So overall memory consumption with the above patch and scale of (100 nodes, > > 1000 project, 3000+services, 16300 pods) is around 5G now for nbdb, compared > > to 12-16G. > > > > At this point i am not sure whether 5G memory is expected consumption for > > nbdb process at this scale or not. > > > > I was not able to collect memory consumption stats periodically, so i am not > > sure the rate of growth, so i am not sure if the memory consumption settled > > down or was it constantly growing with the number of projects/services. I > > will profile that in the next scale test. > > > > I collected logs (db, other logs) from this scale test and uploaded to the > > following link. Please have a look and if something that can be further > > improved to reduce this memory consumption further, please let us know. > > > Hi. Thanks for this data. I looked at Northbound database files and that > is already a good improvement. > > After your change the main things that contributes to memory consumption > are transactions that adds ports to this one port group. > > Looking at the database in the end of the test there are ~16K ports in this > group. One transaction to add or remove one port from this port group > generates 700 KB transaction. Addition of all 16K ports one by one generates > ~5-6 GB of traffic. Taking into account that this traffic is reduplicated > within raft cluster and parsed in a few places while executing/storing > transactions, compacting the database. So, I could imagine that 6GB of > process memory might be used at some point. This doesn't mean that this > could not be improved by reclaiming some memory back to system, but still > value seems reasonable. > > If you want to further reduce memory consumption there are possible > solutions: > > 1. Batching of these transaction. If it's possible to add all ports of one > service in one transaction, this will reduce amount of traffic > significantly. > i.e. instead of 16K transactions (one per port) we will have only 3K > transactions (one per service). e.g. ~5x improvement (speculative number). > > 2. Dumitru pointed out that there is no need to add all ports to this port > group just because OVN currently creates ACLs per logical switch and not > per port. So, you only need one port per logical switch being added to > this port group. Dumitru said that there is a management port in each > logical switch, so it should be possible to add only this one port to > the port group and remove this port from the port group in case where are > no more ports that needs reject acls on this logical switch. > Dumitru might correct me if I didn't get this correctly. I believe (1) is not possible, because logical ports are created as and when containers are created, so batching them would be tricky, and probably break some of the kubernetes conformance tests as well. If (2) works then that is simple solution, we don't really need (1). I will try (2), i think it will reduce the memory usage significantly.
Ilya, So after implementation of (2) above, the memory usage went down from 5G to 500MB (https://github.com/ovn-org/ovn-kubernetes/pull/1711), so that's a significant improvement. With reduced memory consumption, i have not seen any partition in nb-db cluster as well, so that increased the overall stability of the raft cluster as well.
@mifiedle Can you help verifying it?Let me know I can also hop on to the setup if required. Thanks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633