From the must-gather its hard to tell exactly what happened, as the the ovnkube-master logs have rotated so I don't see the add for http-perf-122 -n http-scale-passthrough. On the nodes it looks like this pod has rotated to different nodes. When you repeat the test to trigger the problem, do you wait for ovnkube-master to finish deleting all the pods before you start the next test? For example, instead of just doing oc delete namespace http-scale-passthrough and waiting for it to return, also oc logs -n openshift-ovn-kubernetes <master> -c ovnkube-master --follow; #wait for all the pods to finish getting deleted, before starting the next test To me this looks like the scale issues identified in https://bugzilla.redhat.com/show_bug.cgi?id=1959352 I can see 5 seconds to annotate the pod in ovnkube-master for some other pods: 2021-09-17T15:53:47.025928199Z I0917 15:53:47.025886 1 pods.go:251] [http-scale-reencrypt/http-perf-303] addLogicalPort took 5.000645366s 2021-09-17T15:53:47.188129212Z I0917 15:53:47.188067 1 pods.go:251] [http-scale-reencrypt/http-perf-358] addLogicalPort took 5.085611168s 2021-09-17T15:53:47.243243616Z I0917 15:53:47.243197 1 pods.go:251] [http-scale-reencrypt/http-perf-281] addLogicalPort took 5.042022155s Would also be helpful to get the must-gather while the affected pods still exist so that we can see when they were scheduled on the node.
Thanks Qiujie reported this issue. this looks like same bug with https://bugzilla.redhat.com/show_bug.cgi?id=2003558 @Tim Please see https://bugzilla.redhat.com/show_bug.cgi?id=2003558#c17 There is lived cluster kubeconfig and must-gather for debugging, thanks.
and also this one https://bugzilla.redhat.com/show_bug.cgi?id=1997205
Was there ever resolution on this? Ran into this issue on a 4.8.24 cluster and then a clsuster after upgrade from 4.9.17 to 4.9.18.
@
@anusaxen Added.