Bug 1968009

Summary:	OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding
Product:	OpenShift Container Platform	Reporter:	jamo luhrsen <jluhrsen>
Component:	Networking	Assignee:	Tim Rozet <trozet>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	high	CC:	achernet, bbennett, dcbw, ealcaniz, mmahmoud, stbenjam, trozet, vrutkovs
Version:	4.8	Flags:	trozet: needinfo-
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-20 19:13:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description jamo luhrsen 2021-06-04 17:33:27 UTC

Description of problem:

Recent ovn-upgrade jobs are failing the "pods should successfully create sandboxes by other" test
case and the logs are pointing to OVS port bindings not happening in time. The upgrade job has
a handful of other persistent test case failures that may be related. They are indicating problems
with routes not ready and apiserver pod not available (see job link below) which are possibly
all related to network plubming not working in a timely fashion. Instead of filing new bugs
for those issues at this time, hoping to get this one resolved and check that those problems
are gone or not.

example:
  ns/openshift-network-diagnostics pod/network-check-source-7bdc6fbb47-2tm25 node/ci-op-37wt74n6-8929c-srngm-worker-b-5jtpb - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-source-7bdc6fbb47-2tm25_openshift-network-diagnostics_22fd62e9-5799-4542-b530-b6491c6caa53_0(c25b2493850c2231ce9cab9e9ee9f0c1df6916b7c2608ff4f084bd48c733d3cd): [openshift-network-diagnostics/network-check-source-7bdc6fbb47-2tm25:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-network-diagnostics/network-check-source-7bdc6fbb47-2tm25 c25b2493850c2231ce9cab9e9ee9f0c1df6916b7c2608ff4f084bd48c733d3cd] [openshift-network-diagnostics/network-check-source-7bdc6fbb47-2tm25 c25b2493850c2231ce9cab9e9ee9f0c1df6916b7c2608ff4f084bd48c733d3cd] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding

which is from this job:
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728

full ovnkube pod log with a the same message:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-wt8b2_ovnkube-node.log

I saw https://bugzilla.redhat.com/show_bug.cgi?id=1959200 which matched the "OVS port binding"
description of this bug, but reading the details it seems that it might only be scale related
which is not the case for this upgade job failure. It looks like the PR for that bug is on
it's way in, so soon we may be able to easily check if it helps or not.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Tim Rozet 2021-06-07 15:07:29 UTC

This is most likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1959200

Once that fix is merged we can verify it.

Comment 2 Dan Williams 2021-06-07 15:20:50 UTC

Just a drive-by quick investigation... picking one example "redhat-marketplace-kn6v8":

~14:31:00 ovnkube-master leader is held by master-1 until 14:42:11
we have no logs of this because must-gather didn't get them; the first logs we have of master-1 are at 14:45:38:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wk4fl_ovnkube-master.log

This is probably the pre-upgrade ovnkube-master pod which, of course, must-gather doesn't get logs for :(

I don't think we can debug much further, because we don't know what that ovnkube-master is doing during this time...

14:31:49 - pod scheduled
"message": "Successfully assigned openshift-marketplace/redhat-marketplace-kn6v8 to ci-op-37wt74n6-8929c-srngm-worker-c-frb96",

14:31:50 - kubelet on node worker-c-frb96 starts pod sandbox
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/nodes/ci-op-37wt74n6-8929c-srngm-worker-c-frb96/
Jun 04 14:31:50.278687 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: time="2021-06-04 14:31:50.274051727Z" level=info msg="Running pod sandbox: openshift-marketplace/redhat-marketplace-kn6v8/POD" id=89fc87d9-be79-43ef-bed9-2ad4c027efe8 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox

14:32:01 - CRIO gets the request
Jun 04 14:32:01.011199 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: time="2021-06-04 14:32:01.011127668Z" level=info msg="Got pod network &{Name:redhat-marketplace-kn6v8 Namespace:openshift-marketplace ID:b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1 NetNS:/var/run/netns/c6303f1f-39c3-4e02-8460-d805a9e50e99 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"

14:32:01 - ovnkube-node gets CNI ADD
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovnkube-node.log

14:32:21 - CRIO ADD request times out
Jun 04 14:32:21.528729 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: time="2021-06-04 14:32:21.528625889Z" level=error msg="Error adding network: [openshift-marketplace/redhat-marketplace-kn6v8:ovn-kubernetes]: error adding container to network \"ovn-kubernetes\": CNI request failed with status 400: '[openshift-marketplace/redhat-marketplace-kn6v8 b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1] [openshift-marketplace/redhat-marketplace-kn6v8 b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding\n'"

14:33:03 - ovn-controller claims port
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovn-controller.log


The big question is why ovn-controller on the scheduled node didn't claim the port until 14:33:03, but we can't figure that out because:
1) we don't have ovnkube-master logs from the leader during the period the pod was scheduled
2) we don't have more verbose logging from nbdb or sbdb or northd that would show when ovnkube-master got around to adding the pod LSP

Comment 3 Vadim Rutkovsky 2021-06-08 09:49:26 UTC

(In reply to Dan Williams from comment #2)
> Just a drive-by quick investigation... picking one example
> "redhat-marketplace-kn6v8":
> 
> ~14:31:00 ovnkube-master leader is held by master-1 until 14:42:11
> we have no logs of this because must-gather didn't get them; the first logs
> we have of master-1 are at 14:45:38:
> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/
> periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-
> ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/
> artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wk4fl_ovnkube-master.
> log
> 
> This is probably the pre-upgrade ovnkube-master pod which, of course,
> must-gather doesn't get logs for :(

Upgrade logs are stored at Loki, here are the ovn-kube master logs before upgrade (at initial setup):
https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221622811450000%22,%221622811750000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade%2F1400795131786825728%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-ovn-kubernetes%5C%22%20%7C%20pod_name%3D~%5C%22ovnkube-master.*%5C%22%22%7D%5D

> 
> I don't think we can debug much further, because we don't know what that
> ovnkube-master is doing during this time...
> 
> 14:31:49 - pod scheduled
> "message": "Successfully assigned
> openshift-marketplace/redhat-marketplace-kn6v8 to
> ci-op-37wt74n6-8929c-srngm-worker-c-frb96",
> 
> 14:31:50 - kubelet on node worker-c-frb96 starts pod sandbox
> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/
> periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-
> ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/
> artifacts/nodes/ci-op-37wt74n6-8929c-srngm-worker-c-frb96/
> Jun 04 14:31:50.278687 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]:
> time="2021-06-04 14:31:50.274051727Z" level=info msg="Running pod sandbox:
> openshift-marketplace/redhat-marketplace-kn6v8/POD"
> id=89fc87d9-be79-43ef-bed9-2ad4c027efe8
> name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
> 
> 14:32:01 - CRIO gets the request
> Jun 04 14:32:01.011199 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]:
> time="2021-06-04 14:32:01.011127668Z" level=info msg="Got pod network
> &{Name:redhat-marketplace-kn6v8 Namespace:openshift-marketplace
> ID:b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1
> NetNS:/var/run/netns/c6303f1f-39c3-4e02-8460-d805a9e50e99 Networks:[]
> RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[]
> Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
> 
> 14:32:01 - ovnkube-node gets CNI ADD
> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/
> periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-
> ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/
> artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovnkube-node.log
> 
> 14:32:21 - CRIO ADD request times out
> Jun 04 14:32:21.528729 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]:
> time="2021-06-04 14:32:21.528625889Z" level=error msg="Error adding network:
> [openshift-marketplace/redhat-marketplace-kn6v8:ovn-kubernetes]: error
> adding container to network \"ovn-kubernetes\": CNI request failed with
> status 400: '[openshift-marketplace/redhat-marketplace-kn6v8
> b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1]
> [openshift-marketplace/redhat-marketplace-kn6v8
> b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1] failed to
> configure pod interface: error while waiting on
> OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting
> for OVS port binding\n'"
> 
> 14:33:03 - ovn-controller claims port
> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/
> periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-
> ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/
> artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovn-controller.log
> 
> 
> The big question is why ovn-controller on the scheduled node didn't claim
> the port until 14:33:03, but we can't figure that out because:
> 1) we don't have ovnkube-master logs from the leader during the period the
> pod was scheduled
> 2) we don't have more verbose logging from nbdb or sbdb or northd that would
> show when ovnkube-master got around to adding the pod LSP

Comment 4 Dan Williams 2021-06-08 16:20:05 UTC

(In reply to Vadim Rutkovsky from comment #3)
> (In reply to Dan Williams from comment #2)
> > Just a drive-by quick investigation... picking one example
> > "redhat-marketplace-kn6v8":
> > 
> > ~14:31:00 ovnkube-master leader is held by master-1 until 14:42:11
> > we have no logs of this because must-gather didn't get them; the first logs
> > we have of master-1 are at 14:45:38:
> > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/
> > periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-
> > ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/
> > artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wk4fl_ovnkube-master.
> > log
> > 
> > This is probably the pre-upgrade ovnkube-master pod which, of course,
> > must-gather doesn't get logs for :(
> 
> Upgrade logs are stored at Loki, here are the ovn-kube master logs before
> upgrade (at initial setup):
> https://grafana-loki.ci.openshift.org/
> explore?orgId=1&left=%5B%221622811450000%22,%221622811750000%22,
> %22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-
> ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-
> gcp-ovn-
> upgrade%2F1400795131786825728%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C
> %22openshift-ovn-kubernetes%5C%22%20%7C%20pod_name%3D~%5C%22ovnkube-master.
> *%5C%22%22%7D%5D

I don't suppose we can ever get those into must-gather somehow?

Thanks for the Loki link... Though that only returns ~900 lines, while I'd expect quite a few more. The timestamps there don't correspond to anything in must-gather either, they appear to end a full hour+ (12:59:xx) before the upgraded pods start?

Comment 5 Tim Rozet 2021-07-27 20:21:39 UTC

Jamo, can you check the jobs for 4.9 upgrade now? We had fixes go in for 1959200.

Comment 6 jamo luhrsen 2021-07-28 21:15:13 UTC

(In reply to Tim Rozet from comment #5)
> Jamo, can you check the jobs for 4.9 upgrade now? We had fixes go in for
> 1959200.

tl;dr, I think this is resolved

So, just looking for the symptom of "OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding" message
seen in a failure of the "pods should successfully create sandboxes by other":

1. the last time it was seen in a periodic job was 10 days ago [0]
2. the downstream merge [1] happened 9 days ago

one thing I noticed and not sure how important it is or not, is that one PR [2] that landed for 1959200 didn't
seem to make it in to the downstream merge. there are 3 mentioned in 1959200 and only two are there:

❯ git lg | rg 'pass Kube API auth via|cni: cancel old pod sandbox add|Adds back checking OF flows for CNI'
7dd78fc7 2021-07-17 cni: pass Kube API auth via cnishim response, not CNI config file [Dan Williams]
22bed6a1 2021-05-12 Adds back checking OF flows for CNI [Tim Rozet]


I know you know, but all/some of these things all need to be backported. We still see the same symptoms
currently in 4.8 [3]

[0] https://search.ci.openshift.org/?search=timed+out+while+waiting+for+OVS+port+binding&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
[1] https://github.com/openshift/ovn-kubernetes/pull/609
[2] https://github.com/ovn-org/ovn-kubernetes/pull/2275
[3] https://search.ci.openshift.org/?search=timed+out+while+waiting+for+OVS+port+binding&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 7 Stephen Benjamin 2021-07-29 16:57:24 UTC

Did the message change slightly? I see "timed out waiting for OVS port binding (ovn-installed)" on e2e-aws-ovn jobs, example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn/1420661062511890432

Testgrid: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn

Comment 8 Dan Williams 2021-07-29 18:17:03 UTC

Yes, the message changed recently via https://github.com/ovn-org/ovn-kubernetes/pull/2275. Previously that message was printed for many reasons, a good number of them unrelated to OVN actually setting ovn-installed. For example, waiting too long for short-lived pods that had already been torn down and recreated by the apiserver (happens in the StatefulSet tests). The messages should be much more accurate now and point to the actual problem. If we do see messages about ovn-installed now, it's likely thats the real culprit.

But taking one run there:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn/1420661062511890432

At least one instance of ovn-installed failure for the myphp-3-build pod is because the database transferred RAFT leadership to snapshot, and northd didn't reconnect to the right DB in time; that is a known problem and will be fixed in https://github.com/openshift/ovn-kubernetes/pull/630. I suspect many more occurrences are due to this.

Comment 9 Tim Rozet 2021-08-20 19:12:21 UTC

There are two possible error messages now:
"timed out waiting for annotations"
"timed out waiting for OVS port binding"

According to ci search, there are still instances of the first one:
https://search.ci.openshift.org/?search=timed+out+waiting+for+annotations&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

But I dont see any instances of the port binding:
https://search.ci.openshift.org/?search=timed+out+while+waiting+for+OVS+port+binding&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

The timed out waiting for annotations is most likely ovnkube-master performance related, which has a bug and a bunch of fixes that just went in:

https://github.com/openshift/ovn-kubernetes/pull/667
https://bugzilla.redhat.com/show_bug.cgi?id=1959352

Duping this bug to https://bugzilla.redhat.com/show_bug.cgi?id=1959200

Comment 10 Tim Rozet 2021-08-20 19:13:00 UTC


*** This bug has been marked as a duplicate of bug 1959200 ***