Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1960042

Summary:

[scale] northd at 100% and taking > 30sec to process changes

Product:

OpenShift Container Platform

Reporter:

Joe Talerico <jtaleric>

Component:

Networking

Assignee:

Tim Rozet <trozet>

Networking sub component:

ovn-kubernetes

QA Contact:

Anurag saxena <anusaxen>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

aconstan, astoycos, dcbw, dceara, vpickard

Version:

4.8

Target Milestone:

---

Target Release:

4.9.0

Hardware:

All

OS:

All

Whiteboard:

perfscale-ovn

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1962338 (view as bug list)

Environment:

Last Closed:

2021-10-05 17:25:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1962338, 1962818, 1962833

Bug Blocks:

Attachments:

Description	Flags
must-gather network	none
OVN NBDB where we observe long (30s) poll intervals	none

Description Joe Talerico 2021-05-12 21:02:19 UTC

Created attachment 1782550 [details]
must-gather network

Description of problem:
OCP4.8 w/ OVNKubernetes as the SDN. Scaled to 300 nodes, we are seeing ovn-northd consume an entire core:

      1 root      20   0 1521272   1.4g   7420 R  98.7   1.1 370:31.04 ovn-northd                                                                                                                                                                             

TimR also noted that it is taking 30+ seconds for northd to process changes.


Version-Release number of selected component (if applicable):
4.8

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP4.8
2. Scale to 300 nodes
3. Run clusterdensity 2k

Actual results:

Comment 3 Tim Rozet 2021-05-19 13:25:28 UTC

Created attachment 1784815 [details]
OVN NBDB where we observe long (30s) poll intervals

Comment 4 Dumitru Ceara 2021-05-19 15:33:03 UTC

After initial analysis, a decent time of the poll interval is spent in
building router load balancer logical flows.

1. get_router_load_balancer_ips() is called way too often, we can just
precompute those IPs once per iteration, instead of calling it for every
logical router port.  An initial test shows that this change reduces
the loop iteration time from ~22s to ~19s.

2. ovn_lflow_add_at() always builds a logical flow record even though
this will be discarded if the logical flow is aggregated on a datapath
group.  We can instead try to delay the creation of new flow records
until really necessary.  This saves a decent amount of allocations and
memory copying.  An initial test shows that this change reduces the
loop iteration time further to ~15s.

3. We can try to change the way load balancer flows are built.
Currently for X routers (or switches) with Y load balancers applied to
them we do:
- for every router:
  - for every load balancer:
    - parse and generate lots of common lb stuff (e.g., VIPs, backends)
    - generate one logical flow per VIP.

I think we can save quite a lot of CPU by changing this to:
- for every load balancer:
  - parse and generate lots of common lb stuff (e.g., VIPs, backends)
  - for every router:
    - generate one logical flow per VIP.

4. I also enabled northd parallelization and this further reduced the
loop iteration times to ~8s with the cost of northd consuming up to
900% CPU.  However, northd parallelization is a new feature and needs
further testing and needs to be enabled in CI.

I'll open OVN BZs for all items above so we can track the work
independently.

Comment 5 Dan Williams 2021-10-05 17:25:28 UTC

All linked bugs have been addressed in ovn21.09-21.09.0-9.el8fdp which is part of the 4.9.0 release, via https://bugzilla.redhat.com/show_bug.cgi?id=1999852.

Going to dupe this bug to https://bugzilla.redhat.com/show_bug.cgi?id=1999852

*** This bug has been marked as a duplicate of bug 1999852 ***