The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1903210 - [OVN SCALE] ovn-controller stops listening/handling port claim events
Summary: [OVN SCALE] ovn-controller stops listening/handling port claim events
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: OVN
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Numan Siddique
QA Contact: ying xu
URL:
Whiteboard:
Depends On:
Blocks: 1903265 1934520 1963064
TreeView+ depends on / blocked
 
Reported: 2020-12-01 16:18 UTC by Tim Rozet
Modified: 2021-12-15 08:11 UTC (History)
16 users (show)

Fixed In Version: ovn2.13-20.12.0-19
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-15 14:34:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-966 0 None Closed CVE-2020-25710, CVE-2020-25709 2022-04-12 13:44:48 UTC
Red Hat Product Errata RHBA-2021:0839 0 None None None 2021-03-15 14:34:59 UTC

Description Tim Rozet 2020-12-01 16:18:29 UTC
Description of problem:
At heavy load with around 10k pods, 250 nodes, 3k services we see that a bunch of pods across the nodes never get to running state. We can see that the ovn-kubernetes CNI attaches the port, but ovn-controller never claims the port and binds it to the chassis. Forcing recompute on ovn-controller doesn't help. I was also unable to test whether restarting ovn-controller would fix the issue, as OVN SBDB was at such high CPU that the new ovn-controller instance was unable to connect to it.

Comment 2 Numan Siddique 2020-12-02 05:29:03 UTC
To address the high memory usage of ovn-controller (if it is a concern), we can disable lflow caching.
ovn-controller caches the lflow-to-oflows in a cache so that we don't have to reparse the expr match of a logical flow
again and again.

As number of lflows increase the memory consumption would go high too. At this point we don't have the support to limit the cache.

To disable the caching we need to run - ovs-vsctl set open . external_ids:ovn-enable-lflow-cache=false
on all the nodes where ovn-controller runs.

Comment 3 Tim Rozet 2020-12-03 16:02:32 UTC
I think there was a miscommunication. The high RAM was for the SBDB, not ovn-controller...so disabling lflow caching wont really help here. Based on my last conversation with Numan he thinks the issue might be because the ct zones are constantly swapping. We may need to reproduce this bug again in the scale lab and debug ovn-controller further.

Comment 4 Dan Williams 2021-01-27 21:07:35 UTC
Numan, what were the next steps here?

Comment 8 Numan Siddique 2021-02-13 07:43:12 UTC
The patch to fix this issue is submitted for review - https://patchwork.ozlabs.org/project/ovn/patch/20210213073959.1653844-1-numans@ovn.org/

Thanks

Comment 10 Jianlin Shi 2021-02-19 06:14:31 UTC
Hi Tim,

per comment 8, the issue is supposed to be fixed in the latest build 20.12.0-20: http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn2.13/20.12.0/20.el8fdp/.

as there is no reproducer with ovn setup, could you help to test the build in your environment? thanks

Comment 13 ying xu 2021-03-10 08:20:52 UTC
since no simple reproducer until now, it is hard for us to set large scale in the test env, so just test regression and set sanityonly.

Comment 17 errata-xmlrpc 2021-03-15 14:34:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0839

Comment 19 ffernand 2021-05-25 14:04:13 UTC
*** Bug 1963064 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.