2005598 – [4.9]Failed to configure pod interface: timed out waiting for OVS port binding

Bug 2005598 - [4.9]Failed to configure pod interface: timed out waiting for OVS port binding

Summary: [4.9]Failed to configure pod interface: timed out waiting for OVS port binding

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Tim Rozet
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-18 14:56 UTC by Qiujie Li
Modified:	2022-11-02 07:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-01-10 16:50:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 2 Tim Rozet 2021-09-20 19:23:44 UTC

From the must-gather its hard to tell exactly what happened, as the the ovnkube-master logs have rotated so I don't see the add for http-perf-122 -n http-scale-passthrough. On the nodes it looks like this pod has rotated to different nodes. When you repeat the test to trigger the problem, do you wait for ovnkube-master to finish deleting all the pods before you start the next test? For example, instead of just doing oc delete namespace http-scale-passthrough and waiting for it to return, also oc logs -n openshift-ovn-kubernetes <master> -c ovnkube-master --follow; #wait for all the pods to finish getting deleted, before starting the next test

To me this looks like the scale issues identified in https://bugzilla.redhat.com/show_bug.cgi?id=1959352

I can see 5 seconds to annotate the pod in ovnkube-master for some other pods:


2021-09-17T15:53:47.025928199Z I0917 15:53:47.025886       1 pods.go:251] [http-scale-reencrypt/http-perf-303] addLogicalPort took 5.000645366s
2021-09-17T15:53:47.188129212Z I0917 15:53:47.188067       1 pods.go:251] [http-scale-reencrypt/http-perf-358] addLogicalPort took 5.085611168s
2021-09-17T15:53:47.243243616Z I0917 15:53:47.243197       1 pods.go:251] [http-scale-reencrypt/http-perf-281] addLogicalPort took 5.042022155s


Would also be helpful to get the must-gather while the affected pods still exist so that we can see when they were scheduled on the node.

Comment 3 zhaozhanqi 2021-09-22 07:48:43 UTC

Thanks Qiujie reported this issue. this looks like same bug with https://bugzilla.redhat.com/show_bug.cgi?id=2003558

@Tim Please see https://bugzilla.redhat.com/show_bug.cgi?id=2003558#c17 There is lived cluster kubeconfig and must-gather for debugging, thanks.

Comment 4 zhaozhanqi 2021-09-22 07:51:19 UTC

and also this one https://bugzilla.redhat.com/show_bug.cgi?id=1997205

Comment 19 bowredhat 2022-02-14 22:09:33 UTC

Was there ever resolution on this?  Ran into this issue on a 4.8.24 cluster and then a clsuster after upgrade from 4.9.17 to 4.9.18.

Comment 20 Anurag saxena 2022-11-01 13:24:29 UTC

Comment 22 Qiujie Li 2022-11-02 07:34:20 UTC

@anusaxen Added.

Note You need to log in before you can comment on or make changes to this bug.