Hide Forgot
Description of problem: On 4.10 OpenShift cluster with ~250 worker nodes, we launched around 35-60 pods per node. All the ovn-controller instances across the cluster seems be utilizing a reasonable amount of memory except two nodes. Once the pods are launched, they are not deleted and the cluster is in a steady state. They are: worker000-fc640 worker18 Over time their memory keeps increasing without bounds and without any activity happening in the cluster. I am suspecting a memory leak here. https://snapshot.raintank.io/dashboard/snapshot/7pRROXWYy0lXwsqar4ln1fwc5dzZEDb6 Looking at memory stats on one of the problem nodes we see [root@worker000-fc640 ~]# ovn-appctl -t ovn-controller memory/show lflow-cache-entries-cache-expr:16030 lflow-cache-entries-cache-matches:9621 lflow-cache-size-KB:65688 ofctrl_desired_flow_usage-KB:37379 ofctrl_installed_flow_usage-KB:28470 ofctrl_sb_flow_ref_usage-KB:9361 Comparing it a node which is NOT exhibiting this leak, I don't spot many differences Node without leak: sh-4.4# ovn-appctl -t ovn-controller memory/show lflow-cache-entries-cache-expr:15804 lflow-cache-entries-cache-matches:9547 lflow-cache-size-KB:64684 ofctrl_desired_flow_usage-KB:37304 ofctrl_installed_flow_usage-KB:28425 ofctrl_sb_flow_ref_usage-KB:9326 Version-Release number of selected component (if applicable): [kni@e16-h12-b02-fc640 web-burner]$ oc rsh -c ovn-controller ovnkube-node-qj8dq sh-4.4# rpm -qa | grep ovn ovn21.09-central-21.09.0-19.el8fdp.x86_64 ovn21.09-vtep-21.09.0-19.el8fdp.x86_64 ovn21.09-21.09.0-19.el8fdp.x86_64 ovn21.09-host-21.09.0-19.el8fdp.x86_64 sh-4.4# How reproducible: Only reproducible on some nodes Steps to Reproduce: 1. Deploy a large cluster 2. Launch a few pods 3. remain at steady state and watch ovn-controller memory grow boundlessly on some nodes Actual results: ovn-controller memory on some nodes keeps growing without bounds indicating a memory leak Expected results: Memory should be within reasonable bounds and not grow at steady state Additional info:
Placed DBS and provided Numan access to those.
Created attachment 1830071 [details] DBs and conf.db of worker node
perf record output shows pinctrl0 thread is hot on CPu Event count (approx.): 6723064170 # # Overhead Command Shared Object Symbol # ........ .............. ................... ........................................... # 1.97% ovn_pinctrl0 libpthread-2.28.so [.] __pthread_rwlock_wrlock 1.84% ovn_pinctrl0 libpthread-2.28.so [.] __pthread_rwlock_rdlock 1.77% ovn_pinctrl0 libpthread-2.28.so [.] __pthread_rwlock_unlock 1.75% ovn_pinctrl0 [kernel.kallsyms] [k] copy_user_enhanced_fast_string 1.62% ovn_pinctrl0 libc-2.28.so [.] _int_malloc 1.53% ovn_pinctrl0 [kernel.kallsyms] [k] avc_has_perm 1.16% ovn_pinctrl0 [kernel.kallsyms] [k] _raw_spin_lock 1.09% ovn_pinctrl0 libc-2.28.so [.] malloc 0.96% ovn_pinctrl0 libc-2.28.so [.] __memmove_avx_unaligned_erms 0.96% ovn_pinctrl0 libc-2.28.so [.] _int_free 0.80% ovn_pinctrl0 ovn-controller [.] 0x00000000000b87c1 0.75% ovn_pinctrl0 [kernel.kallsyms] [k] copy_user_generic_unrolled 0.71% ovn_pinctrl0 libc-2.28.so [.] __memset_avx2_unaligned_erms 0.69% ovn_pinctrl0 libpthread-2.28.so [.] __pthread_enable_asynccancel 0.69% ovn_pinctrl0 libc-2.28.so [.] __memcmp_avx2_movbe 0.67% ovn_pinctrl0 ovn-controller [.] 0x000000000011c8fd 0.65% ovn_pinctrl0 [kernel.kallsyms] [k] find_vma 0.61% ovn_pinctrl0 ovn-controller [.] 0x00000000000469bb 0.61% ovn_pinctrl0 [kernel.kallsyms] [k] skb_set_owner_w 0.60% ovn_pinctrl0 ovn-controller [.] 0x00000000000bdfbb 0.58% ovn_pinctrl0 ovn-controller [.] 0x00000000000b87a6
As discussed above the solution is to use OVN's meter functionality to rate-limit packet-in to ovn-controller, which would be configured by ovn-kubernetes. This should be done for BFD and chk-pkt-len at least.
Opened Upstream PR: https://github.com/ovn-org/ovn-kubernetes/pull/2752
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069