Bug 2005598

Summary:	[4.9]Failed to configure pod interface: timed out waiting for OVS port binding
Product:	OpenShift Container Platform	Reporter:	Qiujie Li <qili>
Component:	Networking	Assignee:	Tim Rozet <trozet>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	urgent
Priority:	high	CC:	bbennett, mifiedle, mkennell, pkumar, shishika, vpickard, zzhao
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-01-10 16:50:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 2 Tim Rozet 2021-09-20 19:23:44 UTC

From the must-gather its hard to tell exactly what happened, as the the ovnkube-master logs have rotated so I don't see the add for http-perf-122 -n http-scale-passthrough. On the nodes it looks like this pod has rotated to different nodes. When you repeat the test to trigger the problem, do you wait for ovnkube-master to finish deleting all the pods before you start the next test? For example, instead of just doing oc delete namespace http-scale-passthrough and waiting for it to return, also oc logs -n openshift-ovn-kubernetes <master> -c ovnkube-master --follow; #wait for all the pods to finish getting deleted, before starting the next test

To me this looks like the scale issues identified in https://bugzilla.redhat.com/show_bug.cgi?id=1959352

I can see 5 seconds to annotate the pod in ovnkube-master for some other pods:


2021-09-17T15:53:47.025928199Z I0917 15:53:47.025886       1 pods.go:251] [http-scale-reencrypt/http-perf-303] addLogicalPort took 5.000645366s
2021-09-17T15:53:47.188129212Z I0917 15:53:47.188067       1 pods.go:251] [http-scale-reencrypt/http-perf-358] addLogicalPort took 5.085611168s
2021-09-17T15:53:47.243243616Z I0917 15:53:47.243197       1 pods.go:251] [http-scale-reencrypt/http-perf-281] addLogicalPort took 5.042022155s


Would also be helpful to get the must-gather while the affected pods still exist so that we can see when they were scheduled on the node.

Comment 3 zhaozhanqi 2021-09-22 07:48:43 UTC

Thanks Qiujie reported this issue. this looks like same bug with https://bugzilla.redhat.com/show_bug.cgi?id=2003558

@Tim Please see https://bugzilla.redhat.com/show_bug.cgi?id=2003558#c17 There is lived cluster kubeconfig and must-gather for debugging, thanks.

Comment 4 zhaozhanqi 2021-09-22 07:51:19 UTC

and also this one https://bugzilla.redhat.com/show_bug.cgi?id=1997205

Comment 19 bowredhat 2022-02-14 22:09:33 UTC

Was there ever resolution on this?  Ran into this issue on a 4.8.24 cluster and then a clsuster after upgrade from 4.9.17 to 4.9.18.

Comment 20 Anurag saxena 2022-11-01 13:24:29 UTC

Comment 22 Qiujie Li 2022-11-02 07:34:20 UTC

@anusaxen Added.