Description of problem: At heavy load with around 10k pods, 250 nodes, 3k services we see that a bunch of pods across the nodes never get to running state. We can see that the ovn-kubernetes CNI attaches the port, but ovn-controller never claims the port and binds it to the chassis. Forcing recompute on ovn-controller doesn't help. I was also unable to test whether restarting ovn-controller would fix the issue, as OVN SBDB was at such high CPU that the new ovn-controller instance was unable to connect to it.
To address the high memory usage of ovn-controller (if it is a concern), we can disable lflow caching. ovn-controller caches the lflow-to-oflows in a cache so that we don't have to reparse the expr match of a logical flow again and again. As number of lflows increase the memory consumption would go high too. At this point we don't have the support to limit the cache. To disable the caching we need to run - ovs-vsctl set open . external_ids:ovn-enable-lflow-cache=false on all the nodes where ovn-controller runs.
I think there was a miscommunication. The high RAM was for the SBDB, not ovn-controller...so disabling lflow caching wont really help here. Based on my last conversation with Numan he thinks the issue might be because the ct zones are constantly swapping. We may need to reproduce this bug again in the scale lab and debug ovn-controller further.
Numan, what were the next steps here?
The patch to fix this issue is submitted for review - https://patchwork.ozlabs.org/project/ovn/patch/20210213073959.1653844-1-numans@ovn.org/ Thanks
Hi Tim, per comment 8, the issue is supposed to be fixed in the latest build 20.12.0-20: http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn2.13/20.12.0/20.el8fdp/. as there is no reproducer with ovn setup, could you help to test the build in your environment? thanks
since no simple reproducer until now, it is hard for us to set large scale in the test env, so just test regression and set sanityonly.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0839
*** Bug 1963064 has been marked as a duplicate of this bug. ***