The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1905680 - ovnkube-node/ovn-controller does not scale - requires 20x the memory for same node workload at 10 and 100 node scale
Summary: ovnkube-node/ovn-controller does not scale - requires 20x the memory for same...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.13
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: OVN Team
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks: 1906033
TreeView+ depends on / blocked
 
Reported: 2020-12-08 19:59 UTC by Mike Fiedler
Modified: 2021-01-27 10:43 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1906033 (view as bug list)
Environment:
Last Closed: 2021-01-27 10:43:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Mike Fiedler 2020-12-08 19:59:55 UTC
Description of problem:

1. On a 10 node cluster with no user projects, just OOTB pods/services, ovn-controller uses ~115MB of RSS memory

2. On a 10 node cluster with 100 user projects containing 200 pods and services (20 per node for each), ovn-controller uses ~150MB of RSS

3. On a 100 node cluster with no user projects, just OOTB pods/services, ovn-kubernetes uses ~760MB of RSS memory, roughly 7x step #1

4. On a 100 node cluster with 1000 user projects containing 2000 pods and services (20 per node for each, as in #2), ovn-controller uses 3.3 GB of RSS, roughly 20x step #2

The implication is that larger instance sizes are required to run the exact same node workload on OVN clusters with more nodes.   

In this above test on AWS, m5.large instances worked fine in a 10 node cluster for 20 pods + 20 services per node but went NotReady and OOMed due to ovn-controller memory growth for the same workload in a 100 node cluster.   Instance sizes had to be doubled to run the workload successfully.


Version-Release number of selected component (if applicable):  4.7.0-0.nightly-2020-12-04-013308


How reproducible: Always


Steps to Reproduce:
1. Create a 10 node cluster in AWS with 3 m5.4xlarge masters and 10 m5.large nodes
2. Create 100 projects each with 2 deployments, each with 1 pod + 1 svc.   20 pods + 20 services/node.  Note the ovn-controller memory size on a few nodes.
3. Scale the cluster up to 100 m5.large nodes
4. Create 1000 projects each with 2 deployments, each with 1 pod + 1 svc.   20 pods + 20 services/node.

Actual results:  Nodes will start going NotReady and become unresponsive.   Nodes that are still responsive will show ovn-controller memory usage in excess of 3.2GB


Expected results:

ovn-controller memory usage on a node grows in proportion to the workload on the node, not the number of nodes in the cluster.   A node that can handle 20 pods and services at 10 node scale can handle the same workload at 100 node scale without ovn-controller requiring 20x the memory.

Additional info:

Let me know if must-gather would help or what other logs you might need.

Comment 1 Mike Fiedler 2020-12-08 20:48:05 UTC
Deleting the 1000 projects, 2000 pods/services caused ovn-controller usage to go up to 4.5Gi - 5Gi RSS
Re-creating the projects caused it to go up again to ~6.2GB

Comment 2 Dan Williams 2020-12-09 14:46:51 UTC
Mike, can you attach the southbound DB and the OVS flow dumps (ovs-vsctl dump-flows br-int) when the ovn-controller memory usage grows > 700Mb RSS?

Comment 3 Numan Siddique 2020-12-09 14:51:46 UTC
When you notice the huge memory usage in ovn-controller, can you please run the below command and see if it reduces the memory ?

ovs-vsctl set open . external_ids:ovn-enable-lflow-cache=false

Please give some time like a minute or two. When you run the above command, ovn-controller will disable the caching and recompute all the logical flows.

Thanks

Comment 5 Mike Fiedler 2020-12-09 15:06:11 UTC
Will repro today and gather info requested in comment 2 and comment 3

Comment 6 Numan Siddique 2020-12-09 17:37:51 UTC
If disabling the cache addresses this issue, I thnk we can close this BZ. I raised another BZ to add the option to configure the cache limit - https://bugzilla.redhat.com/show_bug.cgi?id=1906033

Comment 8 Mike Fiedler 2020-12-09 20:08:37 UTC
Tried running ovs-vsctl set open . external_ids:ovn-enable-lflow-cache=false again after deleting projcts/pods/svcs and ovn-controller RSS remained unchanged.

I am running the above command on a node I am watching - let me know if I should be doing it somewhere else.

Comment 10 Raul Sevilla 2020-12-16 11:38:52 UTC
Hi Numan,

As I was doing some tests in a 100 node cluster with the RPMs you mentioned I also took a look at this BZ.

As you suggested, I configured "ovs-vsctl set open . external_ids:ovn-enable-lflow-cache=false" in only a ovn-controller pod and restarted it, then I generated some objects (500 services + 500 namespaces + 2500 pods):

root@ip-172-31-71-55: /tmp # oc get node | grep -c worker
100
root@ip-172-31-71-55: /tmp # oc get ns | wc -l
551
root@ip-172-31-71-55: /tmp # oc get pod -A |  wc -l
3849
root@ip-172-31-71-55: /tmp # oc get svc -A |  wc -l
566

# This one is the pod disabled lflow caching
root@ip-172-31-71-55: ~/e2e-benchmarking/workloads/kube-burner # oc rsh ovnkube-node-wvmg6
Defaulting container name to ovn-controller.
sh-4.4# ovs-vsctl get open . external_ids:ovn-enable-lflow-cache
"false"

# Pods with 
root@ip-172-31-71-55: ~/e2e-benchmarking/workloads/kube-burner # oc adm top pods -l app=ovnkube-node  | sort -k3 -r | tail -10
ovnkube-node-vzfx7   4m           755Mi           
ovnkube-node-t7bns   3m           755Mi           
ovnkube-node-q2fh9   4m           754Mi           
ovnkube-node-m88km   4m           754Mi           
ovnkube-node-qh9wl   4m           753Mi           
ovnkube-node-plzjj   3m           753Mi           
ovnkube-node-knj6x   3m           753Mi           
ovnkube-node-86s8j   4m           753Mi           
ovnkube-node-xqs5s   2m           751Mi           
ovnkube-node-wvmg6   2m           395Mi  

Container breakdown:

root@ip-172-31-71-55: ~/e2e-benchmarking/workloads/kube-burner # oc adm top pods ovnkube-node-wvmg6  --containers  
POD                  NAME              CPU(cores)   MEMORY(bytes)   
ovnkube-node-wvmg6   kube-rbac-proxy   0m           16Mi            
ovnkube-node-wvmg6   ovn-controller    0m           331Mi           
ovnkube-node-wvmg6   ovnkube-node      4m           47Mi     


And the second with lower usage:
root@ip-172-31-71-55: ~/e2e-benchmarking/workloads/kube-burner # oc adm top pods ovnkube-node-xqs5s  --containers  
POD                  NAME              CPU(cores)   MEMORY(bytes)   
ovnkube-node-xqs5s   kube-rbac-proxy   0m           15Mi            
ovnkube-node-xqs5s   ovn-controller    0m           683Mi           
ovnkube-node-xqs5s   ovnkube-node      4m           51Mi      


I can confirm a memory usage reduction after disabling lflow caching. However, we still have to quantify side effects like a higher CPU usage and higher latency from ovn-controller.

Comment 11 Numan Siddique 2021-01-27 10:42:42 UTC
Closing this BZ as https://bugzilla.redhat.com/show_bug.cgi?id=1906033 - which adds the support to limit lflow cache will handle the memory usage issue.


Note You need to log in before you can comment on or make changes to this bug.