Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1960042

Summary: [scale] northd at 100% and taking > 30sec to process changes
Product: OpenShift Container Platform Reporter: Joe Talerico <jtaleric>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: aconstan, astoycos, dcbw, dceara, vpickard
Version: 4.8   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: All   
OS: All   
Whiteboard: perfscale-ovn
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1962338 (view as bug list) Environment:
Last Closed: 2021-10-05 17:25:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1962338, 1962818, 1962833    
Bug Blocks:    
Attachments:
Description Flags
must-gather network
none
OVN NBDB where we observe long (30s) poll intervals none

Description Joe Talerico 2021-05-12 21:02:19 UTC
Created attachment 1782550 [details]
must-gather network

Description of problem:
OCP4.8 w/ OVNKubernetes as the SDN. Scaled to 300 nodes, we are seeing ovn-northd consume an entire core:

      1 root      20   0 1521272   1.4g   7420 R  98.7   1.1 370:31.04 ovn-northd                                                                                                                                                                             

TimR also noted that it is taking 30+ seconds for northd to process changes.


Version-Release number of selected component (if applicable):
4.8

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP4.8
2. Scale to 300 nodes
3. Run clusterdensity 2k

Actual results:

Comment 3 Tim Rozet 2021-05-19 13:25:28 UTC
Created attachment 1784815 [details]
OVN NBDB where we observe long (30s) poll intervals

Comment 4 Dumitru Ceara 2021-05-19 15:33:03 UTC
After initial analysis, a decent time of the poll interval is spent in
building router load balancer logical flows.

1. get_router_load_balancer_ips() is called way too often, we can just
precompute those IPs once per iteration, instead of calling it for every
logical router port.  An initial test shows that this change reduces
the loop iteration time from ~22s to ~19s.

2. ovn_lflow_add_at() always builds a logical flow record even though
this will be discarded if the logical flow is aggregated on a datapath
group.  We can instead try to delay the creation of new flow records
until really necessary.  This saves a decent amount of allocations and
memory copying.  An initial test shows that this change reduces the
loop iteration time further to ~15s.

3. We can try to change the way load balancer flows are built.
Currently for X routers (or switches) with Y load balancers applied to
them we do:
- for every router:
  - for every load balancer:
    - parse and generate lots of common lb stuff (e.g., VIPs, backends)
    - generate one logical flow per VIP.

I think we can save quite a lot of CPU by changing this to:
- for every load balancer:
  - parse and generate lots of common lb stuff (e.g., VIPs, backends)
  - for every router:
    - generate one logical flow per VIP.

4. I also enabled northd parallelization and this further reduced the
loop iteration times to ~8s with the cost of northd consuming up to
900% CPU.  However, northd parallelization is a new feature and needs
further testing and needs to be enabled in CI.

I'll open OVN BZs for all items above so we can track the work
independently.

Comment 5 Dan Williams 2021-10-05 17:25:28 UTC
All linked bugs have been addressed in ovn21.09-21.09.0-9.el8fdp which is part of the 4.9.0 release, via https://bugzilla.redhat.com/show_bug.cgi?id=1999852.

Going to dupe this bug to https://bugzilla.redhat.com/show_bug.cgi?id=1999852

*** This bug has been marked as a duplicate of bug 1999852 ***