Bug 2072710

Summary: Perfscale - pods time out waiting for OVS port binding (ovn-installed)
Product: OpenShift Container Platform Reporter: Mohit Sheth <msheth>
Component: NetworkingAssignee: Surya Seetharaman <surya>
Networking sub component: ovn-kubernetes QA Contact: Mike Fiedler <mifiedle>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: anusaxen, cglombek, mifiedle, rsevilla, wking
Version: 4.11   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: perfscale-ovn
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:04:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mohit Sheth 2022-04-06 19:58:10 UTC
Description of problem:
While running router test (1600 pods - each backed by a svc and a route) on a 120 node Baremetal cluster we see that the pods are not able to come up and stuck in ContainerCreating state with the following error

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-burner-fa0990f2-6sssg_benchmark-operator_5d59d617-a691-41a1-bf0c-29dcc35a9de4_0(b1f02d91f89801bf668a832ec5e008ee0e94f50924586753ee049cd60a8ffda5): error adding pod benchmark-operator_kube-burner-fa0990f2-6sssg to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [benchmark-operator/kube-burner-fa0990f2-6sssg/5d59d617-a691-41a1-bf0c-29dcc35a9de4:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[benchmark-operator/kube-burner-fa0990f2-6sssg b1f02d91f89801bf668a832ec5e008ee0e94f50924586753ee049cd60a8ffda5] [benchmark-operator/kube-burner-fa0990f2-6sssg b1f02d91f89801bf668a832ec5e008ee0e94f50924586753ee049cd60a8ffda5] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:1a:0c [10.131.26.12/23]

Upon looking at SBDB logs we see
05T19:29:38.402Z|39040|timeval|WARN|Unreasonably long 12975ms poll interval (12725ms user, 168ms system)

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-03-27-140854

How reproducible:
Not sure

Steps to Reproduce:
1. Run a scale workload which creates pods, svc and routes at 20 QPS

Actual results:
Pods stuck at ContainerCreating with the above error

Expected results:
All the pods should be up and running

Comment 5 Mike Fiedler 2022-05-09 20:32:15 UTC
@msheth Any chance your team can verify this on 4.11?

Comment 6 Mohit Sheth 2022-05-10 13:50:13 UTC
Hey,I have not come across this in our CI for a while. 
Marking it verified, thank you

Comment 8 Surya Seetharaman 2022-06-21 15:31:35 UTC
Note that actual fix is via https://github.com/openshift/cluster-network-operator/pull/1494,
the first fix linked in the bug was wrong, my bad.

Comment 9 errata-xmlrpc 2022-08-10 11:04:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069