+++ This bug was initially created as a clone of Bug #1905680 +++ Description of problem: 1. On a 10 node cluster with no user projects, just OOTB pods/services, ovn-controller uses ~115MB of RSS memory 2. On a 10 node cluster with 100 user projects containing 200 pods and services (20 per node for each), ovn-controller uses ~150MB of RSS 3. On a 100 node cluster with no user projects, just OOTB pods/services, ovn-kubernetes uses ~760MB of RSS memory, roughly 7x step #1 4. On a 100 node cluster with 1000 user projects containing 2000 pods and services (20 per node for each, as in #2), ovn-controller uses 3.3 GB of RSS, roughly 20x step #2 The implication is that larger instance sizes are required to run the exact same node workload on OVN clusters with more nodes. In this above test on AWS, m5.large instances worked fine in a 10 node cluster for 20 pods + 20 services per node but went NotReady and OOMed due to ovn-controller memory growth for the same workload in a 100 node cluster. Instance sizes had to be doubled to run the workload successfully. Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-12-04-013308 How reproducible: Always Steps to Reproduce: 1. Create a 10 node cluster in AWS with 3 m5.4xlarge masters and 10 m5.large nodes 2. Create 100 projects each with 2 deployments, each with 1 pod + 1 svc. 20 pods + 20 services/node. Note the ovn-controller memory size on a few nodes. 3. Scale the cluster up to 100 m5.large nodes 4. Create 1000 projects each with 2 deployments, each with 1 pod + 1 svc. 20 pods + 20 services/node. Actual results: Nodes will start going NotReady and become unresponsive. Nodes that are still responsive will show ovn-controller memory usage in excess of 3.2GB Expected results: ovn-controller memory usage on a node grows in proportion to the workload on the node, not the number of nodes in the cluster. A node that can handle 20 pods and services at 10 node scale can handle the same workload at 100 node scale without ovn-controller requiring 20x the memory. Additional info: Let me know if must-gather would help or what other logs you might need. --- Additional comment from Mike Fiedler on 2020-12-08 20:48:05 UTC --- Deleting the 1000 projects, 2000 pods/services caused ovn-controller usage to go up to 4.5Gi - 5Gi RSS Re-creating the projects caused it to go up again to ~6.2GB --- Additional comment from Dan Williams on 2020-12-09 14:46:51 UTC --- Mike, can you attach the southbound DB and the OVS flow dumps (ovs-vsctl dump-flows br-int) when the ovn-controller memory usage grows > 700Mb RSS? --- Additional comment from Numan Siddique on 2020-12-09 14:51:46 UTC --- When you notice the huge memory usage in ovn-controller, can you please run the below command and see if it reduces the memory ? ovs-vsctl set open . external_ids:ovn-enable-lflow-cache=false Please give some time like a minute or two. When you run the above command, ovn-controller will disable the caching and recompute all the logical flows. Thanks
Code posted upstream for review: http://patchwork.ozlabs.org/project/ovn/list/?series=226887&state=*
Accepted series was http://patchwork.ozlabs.org/project/ovn/list/?series=228804&state=%2A&archive=both