1986216 – [scale] SNO: Slow Pod recovery due to "timed out waiting for OVS port binding"

Bug 1986216 - [scale] SNO: Slow Pod recovery due to "timed out waiting for OVS port binding"

Summary: [scale] SNO: Slow Pod recovery due to "timed out waiting for OVS port binding"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Andrew Stoycos
QA Contact:	yliu1
Docs Contact:
URL:
Whiteboard:
Depends On:	2002425 2004337 2042494 2107645
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-27 00:06 UTC by browsell
Modified:	2022-07-15 15:09 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:04:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 914	None	Merged	Bug 2042494: [4.9] Set the OVS port as transient	2022-02-09 15:15:31 UTC
Red Hat Issue Tracker	FD-1656	None	None	None	2021-11-18 18:52:45 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:05:06 UTC

Comment 25 Andrew Stoycos 2022-02-15 17:07:22 UTC

@amorenoz Has done some really great work on this, so I'll try to summarize his most recent findings in terms of both Openshift and OVS 

For Openshift 4.10 we no-longer can reproduce this issue 

For Openshift 4.9 We're hoping https://github.com/openshift/ovn-kubernetes/pull/914 resolves the issues for 4.9, based on these findings from Adrian's testing 

Taken straight from Adrian regarding his test in OVS to reproduce the issue there:

```
The test is:
   - Add 254 interfaces
   - For each interface, add about 2k flows (maybe this is too much?). If the insertion fails (ovs-ofproto returns != 0), wait 0.5 seconds and retry.

Round 1: Just run the test normally with a clean ovs:
Mean: 0.013 s
Max: 0.216 s

Round 2: Do not clean the previous 254 interfaces and repeat the test:
Mean: 12.0 s
Max: 42s

So, the main culprit right now seems to be the number of stale ports. They are all re-evaluated on each run of the bridge loop. So, vswitchd reads around 10 new Interfaces, loops over the 254 stale ones, each one generating a context switch due to the ioctl, which makes it process ofp-flows without the benefit of bulk processing...
```  

Also of note, the above delay only occurs on RT kernel, therefore I think we should have some other way to track the continued OVS work exploring other issues caused by kernel datapath + rt + CPU pinning and it's effects on OVS. 

I will re-assign this bug to ovn-kubernetes/ myself and push to ON_QA so that can have QE verify that the issue is no-longer reproducible on both SNO 4.9 / SNO 4.10 

Thanks, 
Andrew

Comment 31 errata-xmlrpc 2022-03-10 16:04:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.