2070674 – [GCP] Routes get timed out and nonresponsive after creating 2K service routes

Bug 2070674 - [GCP] Routes get timed out and nonresponsive after creating 2K service routes

Summary: [GCP] Routes get timed out and nonresponsive after creating 2K service routes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	All
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Jaime Caamaño Ruiz
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:	ovn-perfscale
Duplicates (1):	2078758 (view as bug list)
Depends On:
Blocks:	2104454
TreeView+	depends on / blocked

Reported:	2022-03-31 16:22 UTC by Murali Krishnasamy
Modified:	2022-08-10 11:03 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Upon service configuration changes, OVN-Kubernetes spends excessive time checking if services are already configured in OVN or require updates. Consequence: Meaningful and noticeable latency of service configuration changes being in effect if there are many services in the system Fix: Optimize OVN-Kubernetes so that the time spent checking if services already have the desired configuration state in OVN has been greatly reduced. Result: Reduced latency of service configuration changes being in effect.
Clone Of:
Clones:	2104454 (view as bug list)
Environment:
Last Closed:	2022-08-10 11:03:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 1110	0	None	Merged	Bug 2070674: improve performance of service sync	2022-06-23 10:59:55 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:03:20 UTC

Description Murali Krishnasamy 2022-03-31 16:22:42 UTC

Description of problem:
On running a benchmark tests to create 2k app pods(nginx) and routes on GCP openshift environment with OVNKubernetes CNO, the cluster routes like console, prometheus goes nonresponsive for few minutes and its takes longer(at least 13min) to access the new app routes and hence the test fails consistently. Observed this behavior frequently and spike in ovnkube-master CPU utilization to 6 cores as well. 

Version-Release number of selected component (if applicable):
OCP 4.10.5 GA

How reproducible:
Always on GCP

Steps to Reproduce:
1. Deploy a healthy cluster with atleast 24 worker nodes(8 cpu, 32G mem) on GCP using OVNKubernetes CNO
2. Run kube-burner workload to create 2k pods and routes across multiple namespace 
3. Watch the console or prometheus routes times out during the workload as well as any new application routes takes longer to be reachable
4. ovnkube-master cpu utilization increases to 6 cores

Actual results:
The dataplane test fails during connectivity check because the routes are unreachable after kube-burner finishes creating them. Turns out NBDB is still having load_balancers being added about 13 mins after the services were created. addlogicalports for pods are also exceeding 1s

Expected results:
it should be available within SLA and this workload should not affect other cluster routes 

Additional info:

Comment 2 Murali Krishnasamy 2022-03-31 16:25:36 UTC

script used to reproduce - https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/router-perf-v2/ingress-performance.sh

Comment 12 Jaime Caamaño Ruiz 2022-05-27 08:22:43 UTC

*** Bug 2078758 has been marked as a duplicate of this bug. ***

Comment 15 Mike Fiedler 2022-06-02 19:56:09 UTC

*** Bug 2078758 has been marked as a duplicate of this bug. ***

Comment 20 Mike Fiedler 2022-06-17 17:08:10 UTC

Verified on 4.11.0-0.nightly-2022-06-15-222801

- ran workload described here - 2000 pods/routes of mixed termination types
- test ran successfully - no timeout failures
- console and other routed applications remained responsive throughout the test.

Comment 22 errata-xmlrpc 2022-08-10 11:03:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.