Bug 1905680
Summary: | ovnkube-node/ovn-controller does not scale - requires 20x the memory for same node workload at 10 and 100 node scale | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Mike Fiedler <mifiedle> | |
Component: | ovn2.13 | Assignee: | OVN Team <ovnteam> | |
Status: | CLOSED DEFERRED | QA Contact: | Jianlin Shi <jishi> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | RHEL 8.0 | CC: | aconstan, anusaxen, avishnoi, ctrautma, dblack, dcbw, jishi, mark.d.gray, mkarg, nusiddiq, ralongi, rsevilla | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1906033 (view as bug list) | Environment: | ||
Last Closed: | 2021-01-27 10:43:37 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1906033 |
Description
Mike Fiedler
2020-12-08 19:59:55 UTC
Deleting the 1000 projects, 2000 pods/services caused ovn-controller usage to go up to 4.5Gi - 5Gi RSS Re-creating the projects caused it to go up again to ~6.2GB Mike, can you attach the southbound DB and the OVS flow dumps (ovs-vsctl dump-flows br-int) when the ovn-controller memory usage grows > 700Mb RSS? When you notice the huge memory usage in ovn-controller, can you please run the below command and see if it reduces the memory ? ovs-vsctl set open . external_ids:ovn-enable-lflow-cache=false Please give some time like a minute or two. When you run the above command, ovn-controller will disable the caching and recompute all the logical flows. Thanks If disabling the cache addresses this issue, I thnk we can close this BZ. I raised another BZ to add the option to configure the cache limit - https://bugzilla.redhat.com/show_bug.cgi?id=1906033 Tried running ovs-vsctl set open . external_ids:ovn-enable-lflow-cache=false again after deleting projcts/pods/svcs and ovn-controller RSS remained unchanged. I am running the above command on a node I am watching - let me know if I should be doing it somewhere else. Hi Numan, As I was doing some tests in a 100 node cluster with the RPMs you mentioned I also took a look at this BZ. As you suggested, I configured "ovs-vsctl set open . external_ids:ovn-enable-lflow-cache=false" in only a ovn-controller pod and restarted it, then I generated some objects (500 services + 500 namespaces + 2500 pods): root@ip-172-31-71-55: /tmp # oc get node | grep -c worker 100 root@ip-172-31-71-55: /tmp # oc get ns | wc -l 551 root@ip-172-31-71-55: /tmp # oc get pod -A | wc -l 3849 root@ip-172-31-71-55: /tmp # oc get svc -A | wc -l 566 # This one is the pod disabled lflow caching root@ip-172-31-71-55: ~/e2e-benchmarking/workloads/kube-burner # oc rsh ovnkube-node-wvmg6 Defaulting container name to ovn-controller. sh-4.4# ovs-vsctl get open . external_ids:ovn-enable-lflow-cache "false" # Pods with root@ip-172-31-71-55: ~/e2e-benchmarking/workloads/kube-burner # oc adm top pods -l app=ovnkube-node | sort -k3 -r | tail -10 ovnkube-node-vzfx7 4m 755Mi ovnkube-node-t7bns 3m 755Mi ovnkube-node-q2fh9 4m 754Mi ovnkube-node-m88km 4m 754Mi ovnkube-node-qh9wl 4m 753Mi ovnkube-node-plzjj 3m 753Mi ovnkube-node-knj6x 3m 753Mi ovnkube-node-86s8j 4m 753Mi ovnkube-node-xqs5s 2m 751Mi ovnkube-node-wvmg6 2m 395Mi Container breakdown: root@ip-172-31-71-55: ~/e2e-benchmarking/workloads/kube-burner # oc adm top pods ovnkube-node-wvmg6 --containers POD NAME CPU(cores) MEMORY(bytes) ovnkube-node-wvmg6 kube-rbac-proxy 0m 16Mi ovnkube-node-wvmg6 ovn-controller 0m 331Mi ovnkube-node-wvmg6 ovnkube-node 4m 47Mi And the second with lower usage: root@ip-172-31-71-55: ~/e2e-benchmarking/workloads/kube-burner # oc adm top pods ovnkube-node-xqs5s --containers POD NAME CPU(cores) MEMORY(bytes) ovnkube-node-xqs5s kube-rbac-proxy 0m 15Mi ovnkube-node-xqs5s ovn-controller 0m 683Mi ovnkube-node-xqs5s ovnkube-node 4m 51Mi I can confirm a memory usage reduction after disabling lflow caching. However, we still have to quantify side effects like a higher CPU usage and higher latency from ovn-controller. Closing this BZ as https://bugzilla.redhat.com/show_bug.cgi?id=1906033 - which adds the support to limit lflow cache will handle the memory usage issue. |