Bug 2005598

Summary: [4.9]Failed to configure pod interface: timed out waiting for OVS port binding
Product: OpenShift Container Platform Reporter: Qiujie Li <qili>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: urgent    
Priority: high CC: bbennett, mifiedle, mkennell, pkumar, shishika, vpickard, zzhao
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-10 16:50:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Tim Rozet 2021-09-20 19:23:44 UTC
From the must-gather its hard to tell exactly what happened, as the the ovnkube-master logs have rotated so I don't see the add for http-perf-122 -n http-scale-passthrough. On the nodes it looks like this pod has rotated to different nodes. When you repeat the test to trigger the problem, do you wait for ovnkube-master to finish deleting all the pods before you start the next test? For example, instead of just doing oc delete namespace http-scale-passthrough and waiting for it to return, also oc logs -n openshift-ovn-kubernetes <master> -c ovnkube-master --follow; #wait for all the pods to finish getting deleted, before starting the next test

To me this looks like the scale issues identified in https://bugzilla.redhat.com/show_bug.cgi?id=1959352

I can see 5 seconds to annotate the pod in ovnkube-master for some other pods:


2021-09-17T15:53:47.025928199Z I0917 15:53:47.025886       1 pods.go:251] [http-scale-reencrypt/http-perf-303] addLogicalPort took 5.000645366s
2021-09-17T15:53:47.188129212Z I0917 15:53:47.188067       1 pods.go:251] [http-scale-reencrypt/http-perf-358] addLogicalPort took 5.085611168s
2021-09-17T15:53:47.243243616Z I0917 15:53:47.243197       1 pods.go:251] [http-scale-reencrypt/http-perf-281] addLogicalPort took 5.042022155s


Would also be helpful to get the must-gather while the affected pods still exist so that we can see when they were scheduled on the node.

Comment 3 zhaozhanqi 2021-09-22 07:48:43 UTC
Thanks Qiujie reported this issue. this looks like same bug with https://bugzilla.redhat.com/show_bug.cgi?id=2003558

@Tim Please see https://bugzilla.redhat.com/show_bug.cgi?id=2003558#c17 There is lived cluster kubeconfig and must-gather for debugging, thanks.

Comment 4 zhaozhanqi 2021-09-22 07:51:19 UTC
and also this one https://bugzilla.redhat.com/show_bug.cgi?id=1997205

Comment 19 bowredhat 2022-02-14 22:09:33 UTC
Was there ever resolution on this?  Ran into this issue on a 4.8.24 cluster and then a clsuster after upgrade from 4.9.17 to 4.9.18.

Comment 20 Anurag saxena 2022-11-01 13:24:29 UTC
@

Comment 22 Qiujie Li 2022-11-02 07:34:20 UTC
@anusaxen Added.