The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1906033 - [RFE] [OVN SCALE] limit ovn-controller lflow cache to reduce memory usage
Summary: [RFE] [OVN SCALE] limit ovn-controller lflow cache to reduce memory usage
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.13
Version: FDP 20.E
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Dumitru Ceara
QA Contact: Ehsan Elahi
URL:
Whiteboard:
Depends On: 1905680
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-09 14:55 UTC by Numan Siddique
Modified: 2023-03-13 07:11 UTC (History)
10 users (show)

Fixed In Version: ovn-2021-21.03.0-25.el8fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1905680
Environment:
Last Closed: 2023-03-13 07:11:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-974 0 None None None 2021-10-11 02:54:29 UTC

Description Numan Siddique 2020-12-09 14:55:41 UTC
+++ This bug was initially created as a clone of Bug #1905680 +++

Description of problem:

1. On a 10 node cluster with no user projects, just OOTB pods/services, ovn-controller uses ~115MB of RSS memory

2. On a 10 node cluster with 100 user projects containing 200 pods and services (20 per node for each), ovn-controller uses ~150MB of RSS

3. On a 100 node cluster with no user projects, just OOTB pods/services, ovn-kubernetes uses ~760MB of RSS memory, roughly 7x step #1

4. On a 100 node cluster with 1000 user projects containing 2000 pods and services (20 per node for each, as in #2), ovn-controller uses 3.3 GB of RSS, roughly 20x step #2

The implication is that larger instance sizes are required to run the exact same node workload on OVN clusters with more nodes.   

In this above test on AWS, m5.large instances worked fine in a 10 node cluster for 20 pods + 20 services per node but went NotReady and OOMed due to ovn-controller memory growth for the same workload in a 100 node cluster.   Instance sizes had to be doubled to run the workload successfully.


Version-Release number of selected component (if applicable):  4.7.0-0.nightly-2020-12-04-013308


How reproducible: Always


Steps to Reproduce:
1. Create a 10 node cluster in AWS with 3 m5.4xlarge masters and 10 m5.large nodes
2. Create 100 projects each with 2 deployments, each with 1 pod + 1 svc.   20 pods + 20 services/node.  Note the ovn-controller memory size on a few nodes.
3. Scale the cluster up to 100 m5.large nodes
4. Create 1000 projects each with 2 deployments, each with 1 pod + 1 svc.   20 pods + 20 services/node.

Actual results:  Nodes will start going NotReady and become unresponsive.   Nodes that are still responsive will show ovn-controller memory usage in excess of 3.2GB


Expected results:

ovn-controller memory usage on a node grows in proportion to the workload on the node, not the number of nodes in the cluster.   A node that can handle 20 pods and services at 10 node scale can handle the same workload at 100 node scale without ovn-controller requiring 20x the memory.

Additional info:

Let me know if must-gather would help or what other logs you might need.

--- Additional comment from Mike Fiedler on 2020-12-08 20:48:05 UTC ---

Deleting the 1000 projects, 2000 pods/services caused ovn-controller usage to go up to 4.5Gi - 5Gi RSS
Re-creating the projects caused it to go up again to ~6.2GB

--- Additional comment from Dan Williams on 2020-12-09 14:46:51 UTC ---

Mike, can you attach the southbound DB and the OVS flow dumps (ovs-vsctl dump-flows br-int) when the ovn-controller memory usage grows > 700Mb RSS?

--- Additional comment from Numan Siddique on 2020-12-09 14:51:46 UTC ---

When you notice the huge memory usage in ovn-controller, can you please run the below command and see if it reduces the memory ?

ovs-vsctl set open . external_ids:ovn-enable-lflow-cache=false

Please give some time like a minute or two. When you run the above command, ovn-controller will disable the caching and recompute all the logical flows.

Thanks

Comment 1 Dumitru Ceara 2021-01-28 16:26:59 UTC
Code posted upstream for review: http://patchwork.ozlabs.org/project/ovn/list/?series=226887&state=*


Note You need to log in before you can comment on or make changes to this bug.