1855408 – OVN cluster unstable after running minimal scale test

Bug 1855408 - OVN cluster unstable after running minimal scale test

Summary: OVN cluster unstable after running minimal scale test

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Anil Vishnoi
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:	aos-scalability-45
Depends On:
Blocks:	1883917
TreeView+	depends on / blocked

Reported:	2020-07-09 19:08 UTC by Mohit Sheth
Modified:	2021-02-24 15:15 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1883917 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:13:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
OVN logs (409.92 KB, application/gzip) 2020-07-10 15:35 UTC, Raul Sevilla	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 317	0	None	closed	Bug 1888827: 11-13-2020 merge	2021-02-11 19:24:41 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:15:16 UTC

Internal Links: 1859883 1859924

Description Mohit Sheth 2020-07-09 19:08:04 UTC

Description of problem:

we have tried running cluster-density(mastervertical) test with 1000 projects/100-node-scale on ocp4.5rc2, ocp4.5rc6 and ocp4.5rc7 with ovn as the overlay network.

the test consistently failed to create pods after a certain time with the error #1. The openshift-apiservers pods were out at this point(openshift-api-server.events #2) (apiserver.logs #3) (kubelet.log #4)(journal.log #5)
we have tried the above test with IPAM fix #6 and it still fails

>#1 Warning  FailedCreatePodSandBox  127m  kubelet, ip-10-0-214-114.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deploymentconfig0-1-deploy_mastervertical190_5b214ef7-20f7-4406-a98e-1777d250872a_0(65d29eeca021420497e7f6285d8d5d9bb9d3b7acae7d882957db550171b0daca): Multus: [mastervertical190/deploymentconfig0-1-deploy]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking confAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[mastervertical190/deploymentconfig0-1-deploy] failed to get pod annotation: timed out waiting for the condition
#2 http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.5-ovn/fubar_cluster_logs/openshift-apiserver.events
#3 http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.5-ovn/fubar_cluster_logs/apiserver.log
#4 http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.5-ovn/fubar_cluster_logs/kubelet.log
#5 http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.5-ovn/fubar_cluster_logs/journal.log
#6 https://github.com/ovn-org/ovn-kubernetes/pull/1365


Version-Release number of selected component (if applicable):
ocp4.5rc2, ocp4.5rc6 and ocp4.5rc7 

How reproducible:
Always

Steps to Reproduce:
1. Install cluster using 4.5.0-rc.7
2. Deploy a 100 nodes cluster
3. Run Cluster-density (Mastervertical) with at least 1000 namespaces


Actual results:
Pod creation errors starting from namespaces

Expected results:
All pods and namespaces are created sucessfully


Additional info:

Comment 1 Dan Williams 2020-07-09 19:42:51 UTC

Need the ovnkube-master logs and node logs here. A must-gather for the cluster would be even better.

Comment 2 Mohit Sheth 2020-07-10 13:47:57 UTC

(In reply to Dan Williams from comment #1)
> Need the ovnkube-master logs and node logs here. A must-gather for the cluster would be even better.

The must-gather pod did not start. The other logs can be found here http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.5-ovn/bz_logs/

Comment 3 Raul Sevilla 2020-07-10 15:32:15 UTC

The main issue we observe is that during a period of heavy object creation that involves OVN operations (pods and services), memory consumption from the OVN components grows dramatically, this situation turns out with the Kernel executing OOM killer over other control plane processes.

$ oc adm top pods -n openshift-ovn-kubernetes  -l app=ovnkube-master -n openshift-ovn-kubernetes --containers
POD                    NAME             CPU(cores)   MEMORY(bytes)   
ovnkube-master-579sg   northd           6m           551Mi           
ovnkube-master-579sg   nbdb             8m           13356Mi         
ovnkube-master-579sg   ovnkube-master   4m           134Mi           
ovnkube-master-579sg   sbdb             8m           2310Mi          
ovnkube-master-j9vbz   nbdb             9m           8789Mi          
ovnkube-master-j9vbz   northd           983m         728Mi           
ovnkube-master-j9vbz   ovnkube-master   6m           130Mi           
ovnkube-master-j9vbz   sbdb             20m          3301Mi          
ovnkube-master-mnbkk   northd           6m           551Mi           
ovnkube-master-mnbkk   ovnkube-master   64m          248Mi           
ovnkube-master-mnbkk   sbdb             4m           2277Mi          
ovnkube-master-mnbkk   nbdb             24m          7463Mi

During this object creation storm, we also observe leader election failures like the below:


$ oc logs -f ovnkube-master-579sg -c nbdb --tail 100  | grep error
2020-07-10T15:17:48Z|00423|jsonrpc|WARN|unix#134: send error: Broken pipe
2020-07-10T15:17:48Z|00426|stream_ssl|WARN|SSL_write: system error (Broken pipe)
2020-07-10T15:17:48Z|00427|jsonrpc|WARN|ssl:10.0.134.4:57640: send error: Broken pipe
2020-07-10T15:17:48Z|00428|stream_ssl|WARN|SSL_write: system error (Broken pipe)
2020-07-10T15:17:48Z|00429|jsonrpc|WARN|ssl:10.0.192.134:57768: send error: Broken pipe
2020-07-10T15:17:48Z|00430|stream_ssl|WARN|SSL_write: system error (Broken pipe)
2020-07-10T15:17:48Z|00431|jsonrpc|WARN|ssl:10.0.134.4:57774: send error: Broken pipe
2020-07-10T15:17:48Z|00432|stream_ssl|WARN|SSL_write: system error (Broken pipe)
2020-07-10T15:17:48Z|00433|jsonrpc|WARN|ssl:10.0.168.157:59964: send error: Broken pipe
2020-07-10T15:17:48Z|00434|stream_ssl|WARN|SSL_write: system error (Broken pipe)
2020-07-10T15:17:48Z|00435|stream_ssl|WARN|SSL_write: system error (Broken pipe)

$ oc logs -f ovnkube-master-579sg -c sbdb --tail 100  | grep error
2020-07-10T15:10:49Z|00044|jsonrpc|WARN|ssl:10.0.192.134:37434: receive error: Protocol error
2020-07-10T15:10:49Z|00046|reconnect|WARN|ssl:10.0.192.134:37434: connection dropped (Protocol error)
2020-07-10T15:10:49Z|00048|jsonrpc|WARN|ssl:10.0.168.157:33454: receive error: Protocol error
2020-07-10T15:10:49Z|00050|reconnect|WARN|ssl:10.0.168.157:33454: connection dropped (Protocol error)
2020-07-10T15:22:37Z|00070|jsonrpc|WARN|unix#146: send error: Broken pipe
2020-07-10T15:22:37Z|00072|jsonrpc|WARN|ssl:10.0.192.134:45446: receive error: Protocol error
2020-07-10T15:22:37Z|00074|reconnect|WARN|ssl:10.0.192.134:45446: connection dropped (Protocol error)

Comment 4 Raul Sevilla 2020-07-10 15:35:56 UTC

Created attachment 1700619 [details]
OVN logs

OVN pods logs attached were taken after deploying using a loop.

2000 Deployments with 1 replica + 2000 services

Comment 5 Aniket Bhat 2020-07-14 19:04:44 UTC

13G seems excessive for the config of this size for nbdb. So our hypothesis is nbdb pod grows in size and we end up kill openshift-apiserver pods to allow for more memory? That seems odd. Anil, can you please make sure the ovn and ovs rpm versions we are running on this set up have the fixes from Ilya and the json-rpc inactivity timer disabled?

Comment 6 Dan Williams 2020-07-14 20:40:00 UTC

Found on a 4.5.0-rc7 aws/ovn cluster:

openvswitch2.13-2.13.0-29.el7fdp.x86_64
ovn2.13-2.13.0-31.el7fdp.x86_64

which corresponds to the versions tagged into rhaos-4.5-rhel-7-candidate via our SCR. That OVS version *does* contain the various RAFT fixes:

* Fri Jun 05 2020 Ilya Maximets <i.maximets> - 2.13.0-29
- raft: Avoid sending equal snapshots. (#1834838)
  [65861440ed1baaa8dbe1148a5df230f3d8595e20]

* Fri Jun 05 2020 Ilya Maximets <i.maximets> - 2.13.0-28
- ovsdb-server: Fix schema leak while reading db. (#1834838)
  [13f16c833f38da9c7bdec3615eed90eef5449239]

* Fri Jun 05 2020 Ilya Maximets <i.maximets> - 2.13.0-27
- ovsdb: Add raft memory usage to memory report. (#1834838)
  [a9d4cc1ca8060856fd296fa80d1d5edef131e60a]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-24
- raft: Disable RAFT jsonrpc inactivity probe. (#1822290)
  [b12acf45a6872dda85642cbc73dd86eb529be17e]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-23
- raft: Fix leak of the incomplete command. (#1835729)
  [bb552cffb89104c2bb19b8aff749b8b825a6db13]

Comment 7 Anil Vishnoi 2020-07-15 19:22:07 UTC

nb-db server is definitely under memory stress here, but unfortunately the log file for the nb-db container that grows to 13Gig is corrupted so we can't figure out much out of that. But looking at the other nb-db pods it shows that 
(1) nb-db was at the 5 second election-timer value (reset from 16 second to 5 second during the restart )
```2020-07-10T14:58:29Z|00025|raft|INFO|Election timer changed from 16000 to 5000```
(2) This caused 4 partition during the 6-8 minute of workload test
(3) nb-db memory consume increased from 800m to 8G (on follower nodes, master node log is corrupted) during the 7 minutes tests
```
2020-07-10T15:06:32Z|00026|memory|INFO|peak resident set size grew 56% in last 500.4 seconds, from 537608 kB to 838876 kB
2020-07-10T15:06:32Z|00027|memory|INFO|cells:27051 monitors:0 raft-connections:4
2020-07-10T15:07:22Z|00028|memory|INFO|peak resident set size grew 54% in last 50.0 seconds, from 838876 kB to 1289076 kB
2020-07-10T15:07:22Z|00029|memory|INFO|cells:30586 monitors:0 raft-connections:4
2020-07-10T15:08:22Z|00030|memory|INFO|peak resident set size grew 56% in last 60.0 seconds, from 1289076 kB to 2013316 kB
2020-07-10T15:08:22Z|00031|memory|INFO|cells:34495 monitors:0 raft-connections:4
2020-07-10T15:09:32Z|00032|memory|INFO|peak resident set size grew 51% in last 70.0 seconds, from 2013316 kB to 3042916 kB
2020-07-10T15:09:32Z|00033|memory|INFO|cells:39036 monitors:0 raft-connections:4
2020-07-10T15:11:02Z|00034|memory|INFO|peak resident set size grew 54% in last 90.0 seconds, from 3042916 kB to 4680016 kB
2020-07-10T15:11:02Z|00035|memory|INFO|cells:43987 monitors:0 raft-connections:4
2020-07-10T15:12:52Z|00036|memory|INFO|peak resident set size grew 54% in last 110.0 seconds, from 4680016 kB to 7197376 kB
2020-07-10T15:12:52Z|00037|memory|INFO|cells:49842 monitors:0 raft-connections:4
```

Seems like nb-db hashmap is bloating aggresively and expanding 

```
2020-07-10T15:14:40Z|00047|coverage|INFO|hmap_expand              1139.6/sec   470.000/sec      122.1694/sec   total: 439810
2020-07-10T15:14:40Z|00056|coverage|INFO|util_xalloc              210649.0/sec 728463.650/sec    85936.6717/sec   total: 309372018
```

We need to profile the nb-db memory module to see what's causing this bloating. I am working on recreating this issue, once we have more details on this memory bloating issue, we need to involve ovn team to further dig into it.

Comment 8 Anil Vishnoi 2020-07-15 19:25:10 UTC

This bug is somewhat similar to the following BZ https://bugzilla.redhat.com/show_bug.cgi?id=1838334
Bug reports similar issue but during the horizontal scaling of number of nodes. At scale, It ends up killing sb-db pod and the system never recorvers. Given that underneath nb-db/sb-db has the same code base, the root cause might be same.

Comment 10 Anil Vishnoi 2020-07-23 10:28:15 UTC

This scale test surfaced multiple issues related to memory bloating. I recreated the issue twice on my scale setup and it's not just the pods that's running on master nodes are bloating, but the ovn pods running on worker nodes are bloating as well (consuming around 20G). I opened following two bugs that needs to be addressed for this scale test to pass.

https://bugzilla.redhat.com/show_bug.cgi?id=1859883
https://bugzilla.redhat.com/show_bug.cgi?id=1859924

Comment 11 Anil Vishnoi 2020-07-23 10:29:27 UTC

Lets use this bug to track only nbdb memory bloating issue.

Comment 16 Anil Vishnoi 2020-08-07 01:20:28 UTC

Recreated the issue. The moment we hit the scale test issue, following was the memory consumption of nbdb containers
ovnkube-master-49429   nbdb             2m           12087Mi
ovnkube-master-dd7gs   nbdb             2m           16545Mi
ovnkube-master-jhjwf   nbdb             4m           12472Mi

and the nb db file size was 

-rw-r--r--. 1 root root 51M Aug  7 00:50 ./ovnkube-master-49429/ovnnb_db.db
-rw-r--r--. 1 root root 13M Aug  7 00:50 ./ovnkube-master-dd7gs/ovnnb_db.db
-rw-r--r--. 1 root root 52M Aug  7 00:53 ./ovnkube-master-jhjwf/ovnnb_db.db

Test were using following ovn version.

sh-4.2# rpm -qa | grep ovn
ovn2.13-2.13.0-31.el7fdp.x86_64
ovn2.13-central-2.13.0-31.el7fdp.x86_64
ovn2.13-host-2.13.0-31.el7fdp.x86_64
ovn2.13-vtep-2.13.0-31.el7fdp.x86_64

Log files and db files are present in the log attachment. Also i periodically (5 minutes) collected cluster status and db file sizes to see the partitions and file size growth.

Comment 17 Anil Vishnoi 2020-08-07 01:29:03 UTC

uploaded the logs here as bugzilla doesn't allow more than 20M file

https://drive.google.com/file/d/18dIf6qNP3IQvQOlVAOVjH6ppZuF-jKoV/view?usp=sharing

Comment 18 Ilya Maximets 2020-08-07 11:58:06 UTC

Thanks, Anil.  I'm looking at logs now.

Could you, please, also confirm that it uses openvswitch2.13-2.13.0-29.el7fdp.x86_64 ?

Comment 19 Anil Vishnoi 2020-08-07 17:31:10 UTC

Hi llya, yes, it's using the same version

# rpm -qa| grep openvswitch
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
openvswitch2.13-2.13.0-29.el7fdp.x86_64
openvswitch2.13-devel-2.13.0-29.el7fdp.x86_64

Comment 20 Ilya Maximets 2020-08-11 13:49:48 UTC

Short summary of the email thread discussion:

The main part of the Northbound DB size is just several times repeated
ACLs for each logical switch.  There are 100 logical switches in this test
and each of them has 1980 ACLs.  All of these ACLs are same for each LS.

Just to sore them in a fully compacted DB it takes ~9.5 MB.  One transaction
to add/remove one ACl from all logical switches takes approximately same
amount of data constructed, sent, parsed, replicated within a RAFT cluster,
stored on disk.  In order to add/remove all of this ACLs it takes ~9.4 GB
of traffic.  That actually grows the size of NB DB up to several GB during
the test and might be one of the things that contributes to the process
memory consumption, since all of this data is stored somewhere in the
process memory at some point.

Solution for this might be to use port groups for these kind of ACLs in
ovn-k8s instead of adding ACLs to each logical switch.  This way we will
specify the list of ACLs only once in DB saving a lot of DB space and,
probably, a bit of processing time.

These are reject ACLs that looks like this:

ovs-nbctl: run --id=@acl -- create acl direction=from-lport priority=1000
\"match=\\\"ip4.dst==172.30.246.221 && tcp && tcp.dst==5443\\\"\" action=reject
\"name=98416181-8369-413c-aefb-6bb940850568-172.30.246.221\\\\:5443\"
-- add logical_switch cf601302-50f3-4b93-ae3c-541817e3c20f acls @acl
-- add logical_switch f7895104-517f-4f79-b8c7-7cb2f771d612 acls @acl
-- add logical_switch 41055f06-5098-42fc-bb05-9c08c4d1c248 acls @acl
...

Anil is going to cook a patch to use Port Group for these in ovn-k8s and test.


BTW, it's not clear if this is the root cause of the high memory consumption,
but we'll see, once patch and testing done.  This is a good change anyway.

Comment 21 Ilya Maximets 2020-09-03 11:35:37 UTC

Hi, Anil.  Did you have a chance to test memory consumption with Port Groups
used for reject ACLs?

I see the work-in-progress pull request created by Dan:
  https://github.com/openshift/ovn-kubernetes/pull/233
But there was no activity for a long time.

Comment 23 Anil Vishnoi 2020-09-10 06:25:54 UTC

Once ovn-k8s moves to port-group to apply the LB deny ACL's, cleaning up the old ACL's during the upgrade can be bit tricky. We do use name for these ACL's (lb-<source-ip>-<source-port>), which can be used to clean-up the related old ACL's from Logical_Switch acls column. But this can be bit compute intensive if the upgrade is triggered for the large scale cluster with services in thousands. Current plan is to create port-group per node and move all the Logical_switch ACLs to the port group (including lb, egressfire, policy). So during the upgrade, to cleanup the old ACLs' it just need to clean the acl's column in Logical_Switch table. That would be cleaner approach for upgrade and it will give better structure going forward to deal with similar kind of ACLs.

Comment 24 Ilya Maximets 2020-09-14 12:33:55 UTC

During irc discussion with Anil we concluded that port group per node will not
solve the issue because we will have all the same repeated ACLs but in a
different db table.  So, it should be a single cluster-wide port group for
reject ACls.

Reassigning this to Anil as he is working on the fix.  Also changing the
component to ovn-kubernetes as the fix should be implemented there.

Comment 25 Anil Vishnoi 2020-09-24 06:32:04 UTC

Following PR contains the patch for the issue. Currently trying to test it at higher scale and profile the nb/sb db stats.

https://github.com/ovn-org/ovn-kubernetes/pull/1711

Comment 27 Anil Vishnoi 2020-10-01 07:02:49 UTC

Hi Ilya,

I ran scale test with the above PR (applying ACL's on PortGroup). This help reducing the nbdb memory consumption significantly. I ran the same test, and this time the test passed. nbdb memory consumption was as follows for all the nodes:
```
ovnkube-master-c2tkj:
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 root      20   0 4989868   4.6g   8168 S   0.0   3.7  20:52.34 ovsdb-server -vconsole:info -vfile:off --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --un+

ovnkube-master-tb9rt (master):
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 root      20   0 5471368   5.0g   8036 S   0.0   4.0  24:02.01 ovsdb-server -vconsole:info -vfile:off --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --un+

ovnkube-master-vf4r7:
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 root      20   0 4745824   4.3g   8152 S   0.0   3.5   7:07.13 ovsdb-server -vconsole:info -vfile:off --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --un+
```

So overall memory consumption with the above patch and scale of (100 nodes, 1000 project, 3000+services, 16300 pods) is around 5G now for nbdb, compared to 12-16G. 

At this point i am not sure whether 5G memory is expected consumption for nbdb process at this scale or not. 

I was not able to collect memory consumption stats periodically, so i am not sure the rate of growth, so i am not sure if the memory consumption settled down or was it constantly growing with the number of projects/services. I will profile that in the next scale test.

I collected logs (db, other logs) from this scale test and uploaded to the following link. Please have a look and if something that can be further improved to reduce this memory consumption further, please let us know.

https://drive.google.com/file/d/1hSq5f93kNp1PwY81fifWrCuaKnhaymjh/view?usp=sharing

Comment 31 Ilya Maximets 2020-10-02 13:38:41 UTC

(In reply to Anil Vishnoi from comment #27)
> Hi Ilya,
> 
> I ran scale test with the above PR (applying ACL's on PortGroup). This help
> reducing the nbdb memory consumption significantly. I ran the same test, and
> this time the test passed. nbdb memory consumption was as follows for all
> the nodes:
> ...
> 
> So overall memory consumption with the above patch and scale of (100 nodes,
> 1000 project, 3000+services, 16300 pods) is around 5G now for nbdb, compared
> to 12-16G. 
> 
> At this point i am not sure whether 5G memory is expected consumption for
> nbdb process at this scale or not. 
> 
> I was not able to collect memory consumption stats periodically, so i am not
> sure the rate of growth, so i am not sure if the memory consumption settled
> down or was it constantly growing with the number of projects/services. I
> will profile that in the next scale test.
> 
> I collected logs (db, other logs) from this scale test and uploaded to the
> following link. Please have a look and if something that can be further
> improved to reduce this memory consumption further, please let us know.


Hi.  Thanks for this data.  I looked at Northbound database files and that
is already a good improvement.

After your change the main things that contributes to memory consumption
are transactions that adds ports to this one port group.

Looking at the database in the end of the test there are ~16K ports in this
group.  One transaction to add or remove one port from this port group
generates 700 KB transaction.  Addition of all 16K ports one by one generates
~5-6 GB of traffic.  Taking into account that this traffic is reduplicated
within raft cluster and parsed in a few places while executing/storing
transactions, compacting the database.  So, I could imagine that 6GB of
process memory might be used at some point.  This doesn't mean that this
could not be improved by reclaiming some memory back to system, but still
value seems reasonable.

If you want to further reduce memory consumption there are possible solutions:

1. Batching of these transaction.  If it's possible to add all ports of one
   service in one transaction, this will reduce amount of traffic significantly.
   i.e. instead of 16K transactions (one per port) we will have only 3K
   transactions (one per service). e.g. ~5x improvement (speculative number).

2. Dumitru pointed out that there is no need to add all ports to this port
   group just because OVN currently creates ACLs per logical switch and not
   per port.  So, you only need one port per logical switch being added to
   this port group.  Dumitru said that there is a management port in each
   logical switch, so it should be possible to add only this one port to
   the port group and remove this port from the port group in case where are
   no more ports that needs reject acls on this logical switch.
   Dumitru might correct me if I didn't get this correctly.

Comment 33 Anil Vishnoi 2020-10-06 14:02:21 UTC

(In reply to Ilya Maximets from comment #31)
> (In reply to Anil Vishnoi from comment #27)
> > Hi Ilya,
> > 
> > I ran scale test with the above PR (applying ACL's on PortGroup). This help
> > reducing the nbdb memory consumption significantly. I ran the same test, and
> > this time the test passed. nbdb memory consumption was as follows for all
> > the nodes:
> > ...
> > 
> > So overall memory consumption with the above patch and scale of (100 nodes,
> > 1000 project, 3000+services, 16300 pods) is around 5G now for nbdb, compared
> > to 12-16G. 
> > 
> > At this point i am not sure whether 5G memory is expected consumption for
> > nbdb process at this scale or not. 
> > 
> > I was not able to collect memory consumption stats periodically, so i am not
> > sure the rate of growth, so i am not sure if the memory consumption settled
> > down or was it constantly growing with the number of projects/services. I
> > will profile that in the next scale test.
> > 
> > I collected logs (db, other logs) from this scale test and uploaded to the
> > following link. Please have a look and if something that can be further
> > improved to reduce this memory consumption further, please let us know.
> 
> 
> Hi.  Thanks for this data.  I looked at Northbound database files and that
> is already a good improvement.
> 
> After your change the main things that contributes to memory consumption
> are transactions that adds ports to this one port group.
> 
> Looking at the database in the end of the test there are ~16K ports in this
> group.  One transaction to add or remove one port from this port group
> generates 700 KB transaction.  Addition of all 16K ports one by one generates
> ~5-6 GB of traffic.  Taking into account that this traffic is reduplicated
> within raft cluster and parsed in a few places while executing/storing
> transactions, compacting the database.  So, I could imagine that 6GB of
> process memory might be used at some point.  This doesn't mean that this
> could not be improved by reclaiming some memory back to system, but still
> value seems reasonable.
> 
> If you want to further reduce memory consumption there are possible
> solutions:
> 
> 1. Batching of these transaction.  If it's possible to add all ports of one
>    service in one transaction, this will reduce amount of traffic
> significantly.
>    i.e. instead of 16K transactions (one per port) we will have only 3K
>    transactions (one per service). e.g. ~5x improvement (speculative number).
> 
> 2. Dumitru pointed out that there is no need to add all ports to this port
>    group just because OVN currently creates ACLs per logical switch and not
>    per port.  So, you only need one port per logical switch being added to
>    this port group.  Dumitru said that there is a management port in each
>    logical switch, so it should be possible to add only this one port to
>    the port group and remove this port from the port group in case where are
>    no more ports that needs reject acls on this logical switch.
>    Dumitru might correct me if I didn't get this correctly.
I believe (1) is not possible, because logical ports are created as and when containers are created, so batching them would be tricky, and probably break some of the kubernetes conformance tests as well.
If (2) works then that is simple solution, we don't really need (1). I will try (2), i think it will reduce the memory usage significantly.

Comment 34 Anil Vishnoi 2020-10-21 21:00:01 UTC

Ilya,

So after implementation of (2) above, the memory usage went down from 5G to 500MB (https://github.com/ovn-org/ovn-kubernetes/pull/1711), so that's a significant improvement. With reduced memory consumption, i have not seen any partition in nb-db cluster as well, so that increased the overall stability of the raft cluster as well.

Comment 36 Anurag saxena 2020-12-03 18:34:13 UTC

@mifiedle Can you help verifying it?Let me know I can also hop on to the setup if required. Thanks

Comment 43 errata-xmlrpc 2021-02-24 15:13:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.

aconstan
akamra
anbhat
anusaxen
avishnoi
bbennett
ctrautma
dblack
dcbw
dmellado
i.maximets
jtaleric
mark.d.gray
mifiedle
mkarg
mmichels
mwoodson
nelluri
rkhan
rsevilla
smalleni
trozet
yjoseph
yprokule