Bug 1884049

Summary: [ovn-controller] memory utilization high across all worker nodes
Product: OpenShift Container Platform Reporter: Joe Talerico <jtaleric>
Component: NetworkingAssignee: Anil Vishnoi <avishnoi>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: medium CC: aconstan, bbennett, dcbw, mark.d.gray, trozet
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: All   
OS: All   
Whiteboard: aos-scalability-46
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-18 20:02:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joe Talerico 2020-09-30 20:33:37 UTC
Description of problem:
ovn-controller across all the worker nodes is experiencing high memory utilization.

Size of the deployment and objects :

root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get services -A | wc -l
3005
root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get nodes | grep Ready | wc -l
107
root@ip-172-31-68-73: ~/e2e-benchmarking/workloads/network-perf # oc get pods -A | wc -l
14738

ovn-controller utilization after test[1].

Version-Release number of selected component (if applicable):
OCP4.6-Nightly

How reproducible:
N/A

Steps to Reproduce:
1. 100 Worker node OCP4.6 cluster
2. Mastervertical 1000 test (1000 projets)


Actual results:
[1] https://snapshot.raintank.io/dashboard/snapshot/A9m9EVRSvBedYqPbNPWMjbRyQAclzH0Q?orgId=2

Expected results:


Additional info:
pprof data https://coreos.slack.com/archives/CU9HKBZKJ/p1601486618158900

Comment 1 Ben Bennett 2020-10-01 15:46:48 UTC
This is a good candidate for a 4.6.z backport once resolved.

Comment 2 Dan Williams 2020-10-12 15:22:02 UTC
@avishnoi is this related to the reject ACL flow explosion?

Comment 3 Anil Vishnoi 2020-11-11 06:46:35 UTC
(In reply to Dan Williams from comment #2)
> @avishnoi is this related to the reject ACL flow explosion?

Part of it, but numan's patches of reducing the number of flows would help here. So once we have all those patches and acl patches in our nightly, we need to test this again to see the ovn-controller memory consumption. Currently i believe this high memory consumption is because of the number flows installed on individual worker node (around 2M).

Comment 4 Tim Rozet 2020-11-18 20:02:23 UTC
SBDB and openflow reduction are part of 1859924. Duping this bug to that and if you see it again after 1859924 is resolved please reopen.

*** This bug has been marked as a duplicate of bug 1859924 ***