Bug 2072710

Summary:	Perfscale - pods time out waiting for OVS port binding (ovn-installed)
Product:	OpenShift Container Platform	Reporter:	Mohit Sheth <msheth>
Component:	Networking	Assignee:	Surya Seetharaman <surya>
Networking sub component:	ovn-kubernetes	QA Contact:	Mike Fiedler <mifiedle>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	anusaxen, cglombek, mifiedle, rsevilla, wking
Version:	4.11
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	perfscale-ovn
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 11:04:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mohit Sheth 2022-04-06 19:58:10 UTC

Description of problem:
While running router test (1600 pods - each backed by a svc and a route) on a 120 node Baremetal cluster we see that the pods are not able to come up and stuck in ContainerCreating state with the following error

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-burner-fa0990f2-6sssg_benchmark-operator_5d59d617-a691-41a1-bf0c-29dcc35a9de4_0(b1f02d91f89801bf668a832ec5e008ee0e94f50924586753ee049cd60a8ffda5): error adding pod benchmark-operator_kube-burner-fa0990f2-6sssg to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [benchmark-operator/kube-burner-fa0990f2-6sssg/5d59d617-a691-41a1-bf0c-29dcc35a9de4:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[benchmark-operator/kube-burner-fa0990f2-6sssg b1f02d91f89801bf668a832ec5e008ee0e94f50924586753ee049cd60a8ffda5] [benchmark-operator/kube-burner-fa0990f2-6sssg b1f02d91f89801bf668a832ec5e008ee0e94f50924586753ee049cd60a8ffda5] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:1a:0c [10.131.26.12/23]

Upon looking at SBDB logs we see
05T19:29:38.402Z|39040|timeval|WARN|Unreasonably long 12975ms poll interval (12725ms user, 168ms system)

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-03-27-140854

How reproducible:
Not sure

Steps to Reproduce:
1. Run a scale workload which creates pods, svc and routes at 20 QPS

Actual results:
Pods stuck at ContainerCreating with the above error

Expected results:
All the pods should be up and running

Comment 5 Mike Fiedler 2022-05-09 20:32:15 UTC

@msheth Any chance your team can verify this on 4.11?

Comment 6 Mohit Sheth 2022-05-10 13:50:13 UTC

Hey,I have not come across this in our CI for a while. 
Marking it verified, thank you

Comment 8 Surya Seetharaman 2022-06-21 15:31:35 UTC

Note that actual fix is via https://github.com/openshift/cluster-network-operator/pull/1494,
the first fix linked in the bug was wrong, my bad.

Comment 9 errata-xmlrpc 2022-08-10 11:04:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069