Description of problem: Recent ovn-upgrade jobs are failing the "pods should successfully create sandboxes by other" test case and the logs are pointing to OVS port bindings not happening in time. The upgrade job has a handful of other persistent test case failures that may be related. They are indicating problems with routes not ready and apiserver pod not available (see job link below) which are possibly all related to network plubming not working in a timely fashion. Instead of filing new bugs for those issues at this time, hoping to get this one resolved and check that those problems are gone or not. example: ns/openshift-network-diagnostics pod/network-check-source-7bdc6fbb47-2tm25 node/ci-op-37wt74n6-8929c-srngm-worker-b-5jtpb - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-source-7bdc6fbb47-2tm25_openshift-network-diagnostics_22fd62e9-5799-4542-b530-b6491c6caa53_0(c25b2493850c2231ce9cab9e9ee9f0c1df6916b7c2608ff4f084bd48c733d3cd): [openshift-network-diagnostics/network-check-source-7bdc6fbb47-2tm25:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-network-diagnostics/network-check-source-7bdc6fbb47-2tm25 c25b2493850c2231ce9cab9e9ee9f0c1df6916b7c2608ff4f084bd48c733d3cd] [openshift-network-diagnostics/network-check-source-7bdc6fbb47-2tm25 c25b2493850c2231ce9cab9e9ee9f0c1df6916b7c2608ff4f084bd48c733d3cd] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding which is from this job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728 full ovnkube pod log with a the same message: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-wt8b2_ovnkube-node.log I saw https://bugzilla.redhat.com/show_bug.cgi?id=1959200 which matched the "OVS port binding" description of this bug, but reading the details it seems that it might only be scale related which is not the case for this upgade job failure. It looks like the PR for that bug is on it's way in, so soon we may be able to easily check if it helps or not. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This is most likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1959200 Once that fix is merged we can verify it.
Just a drive-by quick investigation... picking one example "redhat-marketplace-kn6v8": ~14:31:00 ovnkube-master leader is held by master-1 until 14:42:11 we have no logs of this because must-gather didn't get them; the first logs we have of master-1 are at 14:45:38: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wk4fl_ovnkube-master.log This is probably the pre-upgrade ovnkube-master pod which, of course, must-gather doesn't get logs for :( I don't think we can debug much further, because we don't know what that ovnkube-master is doing during this time... 14:31:49 - pod scheduled "message": "Successfully assigned openshift-marketplace/redhat-marketplace-kn6v8 to ci-op-37wt74n6-8929c-srngm-worker-c-frb96", 14:31:50 - kubelet on node worker-c-frb96 starts pod sandbox https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/nodes/ci-op-37wt74n6-8929c-srngm-worker-c-frb96/ Jun 04 14:31:50.278687 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: time="2021-06-04 14:31:50.274051727Z" level=info msg="Running pod sandbox: openshift-marketplace/redhat-marketplace-kn6v8/POD" id=89fc87d9-be79-43ef-bed9-2ad4c027efe8 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox 14:32:01 - CRIO gets the request Jun 04 14:32:01.011199 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: time="2021-06-04 14:32:01.011127668Z" level=info msg="Got pod network &{Name:redhat-marketplace-kn6v8 Namespace:openshift-marketplace ID:b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1 NetNS:/var/run/netns/c6303f1f-39c3-4e02-8460-d805a9e50e99 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" 14:32:01 - ovnkube-node gets CNI ADD https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovnkube-node.log 14:32:21 - CRIO ADD request times out Jun 04 14:32:21.528729 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: time="2021-06-04 14:32:21.528625889Z" level=error msg="Error adding network: [openshift-marketplace/redhat-marketplace-kn6v8:ovn-kubernetes]: error adding container to network \"ovn-kubernetes\": CNI request failed with status 400: '[openshift-marketplace/redhat-marketplace-kn6v8 b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1] [openshift-marketplace/redhat-marketplace-kn6v8 b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding\n'" 14:33:03 - ovn-controller claims port https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovn-controller.log The big question is why ovn-controller on the scheduled node didn't claim the port until 14:33:03, but we can't figure that out because: 1) we don't have ovnkube-master logs from the leader during the period the pod was scheduled 2) we don't have more verbose logging from nbdb or sbdb or northd that would show when ovnkube-master got around to adding the pod LSP
(In reply to Dan Williams from comment #2) > Just a drive-by quick investigation... picking one example > "redhat-marketplace-kn6v8": > > ~14:31:00 ovnkube-master leader is held by master-1 until 14:42:11 > we have no logs of this because must-gather didn't get them; the first logs > we have of master-1 are at 14:45:38: > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp- > ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/ > artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wk4fl_ovnkube-master. > log > > This is probably the pre-upgrade ovnkube-master pod which, of course, > must-gather doesn't get logs for :( Upgrade logs are stored at Loki, here are the ovn-kube master logs before upgrade (at initial setup): https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221622811450000%22,%221622811750000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade%2F1400795131786825728%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-ovn-kubernetes%5C%22%20%7C%20pod_name%3D~%5C%22ovnkube-master.*%5C%22%22%7D%5D > > I don't think we can debug much further, because we don't know what that > ovnkube-master is doing during this time... > > 14:31:49 - pod scheduled > "message": "Successfully assigned > openshift-marketplace/redhat-marketplace-kn6v8 to > ci-op-37wt74n6-8929c-srngm-worker-c-frb96", > > 14:31:50 - kubelet on node worker-c-frb96 starts pod sandbox > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp- > ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/ > artifacts/nodes/ci-op-37wt74n6-8929c-srngm-worker-c-frb96/ > Jun 04 14:31:50.278687 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: > time="2021-06-04 14:31:50.274051727Z" level=info msg="Running pod sandbox: > openshift-marketplace/redhat-marketplace-kn6v8/POD" > id=89fc87d9-be79-43ef-bed9-2ad4c027efe8 > name=/runtime.v1alpha2.RuntimeService/RunPodSandbox > > 14:32:01 - CRIO gets the request > Jun 04 14:32:01.011199 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: > time="2021-06-04 14:32:01.011127668Z" level=info msg="Got pod network > &{Name:redhat-marketplace-kn6v8 Namespace:openshift-marketplace > ID:b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1 > NetNS:/var/run/netns/c6303f1f-39c3-4e02-8460-d805a9e50e99 Networks:[] > RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] > Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" > > 14:32:01 - ovnkube-node gets CNI ADD > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp- > ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/ > artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovnkube-node.log > > 14:32:21 - CRIO ADD request times out > Jun 04 14:32:21.528729 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: > time="2021-06-04 14:32:21.528625889Z" level=error msg="Error adding network: > [openshift-marketplace/redhat-marketplace-kn6v8:ovn-kubernetes]: error > adding container to network \"ovn-kubernetes\": CNI request failed with > status 400: '[openshift-marketplace/redhat-marketplace-kn6v8 > b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1] > [openshift-marketplace/redhat-marketplace-kn6v8 > b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1] failed to > configure pod interface: error while waiting on > OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting > for OVS port binding\n'" > > 14:33:03 - ovn-controller claims port > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp- > ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/ > artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovn-controller.log > > > The big question is why ovn-controller on the scheduled node didn't claim > the port until 14:33:03, but we can't figure that out because: > 1) we don't have ovnkube-master logs from the leader during the period the > pod was scheduled > 2) we don't have more verbose logging from nbdb or sbdb or northd that would > show when ovnkube-master got around to adding the pod LSP
(In reply to Vadim Rutkovsky from comment #3) > (In reply to Dan Williams from comment #2) > > Just a drive-by quick investigation... picking one example > > "redhat-marketplace-kn6v8": > > > > ~14:31:00 ovnkube-master leader is held by master-1 until 14:42:11 > > we have no logs of this because must-gather didn't get them; the first logs > > we have of master-1 are at 14:45:38: > > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > > periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp- > > ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/ > > artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wk4fl_ovnkube-master. > > log > > > > This is probably the pre-upgrade ovnkube-master pod which, of course, > > must-gather doesn't get logs for :( > > Upgrade logs are stored at Loki, here are the ovn-kube master logs before > upgrade (at initial setup): > https://grafana-loki.ci.openshift.org/ > explore?orgId=1&left=%5B%221622811450000%22,%221622811750000%22, > %22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal- > ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e- > gcp-ovn- > upgrade%2F1400795131786825728%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C > %22openshift-ovn-kubernetes%5C%22%20%7C%20pod_name%3D~%5C%22ovnkube-master. > *%5C%22%22%7D%5D I don't suppose we can ever get those into must-gather somehow? Thanks for the Loki link... Though that only returns ~900 lines, while I'd expect quite a few more. The timestamps there don't correspond to anything in must-gather either, they appear to end a full hour+ (12:59:xx) before the upgraded pods start?
Jamo, can you check the jobs for 4.9 upgrade now? We had fixes go in for 1959200.
(In reply to Tim Rozet from comment #5) > Jamo, can you check the jobs for 4.9 upgrade now? We had fixes go in for > 1959200. tl;dr, I think this is resolved So, just looking for the symptom of "OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding" message seen in a failure of the "pods should successfully create sandboxes by other": 1. the last time it was seen in a periodic job was 10 days ago [0] 2. the downstream merge [1] happened 9 days ago one thing I noticed and not sure how important it is or not, is that one PR [2] that landed for 1959200 didn't seem to make it in to the downstream merge. there are 3 mentioned in 1959200 and only two are there: ❯ git lg | rg 'pass Kube API auth via|cni: cancel old pod sandbox add|Adds back checking OF flows for CNI' 7dd78fc7 2021-07-17 cni: pass Kube API auth via cnishim response, not CNI config file [Dan Williams] 22bed6a1 2021-05-12 Adds back checking OF flows for CNI [Tim Rozet] I know you know, but all/some of these things all need to be backported. We still see the same symptoms currently in 4.8 [3] [0] https://search.ci.openshift.org/?search=timed+out+while+waiting+for+OVS+port+binding&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job [1] https://github.com/openshift/ovn-kubernetes/pull/609 [2] https://github.com/ovn-org/ovn-kubernetes/pull/2275 [3] https://search.ci.openshift.org/?search=timed+out+while+waiting+for+OVS+port+binding&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Did the message change slightly? I see "timed out waiting for OVS port binding (ovn-installed)" on e2e-aws-ovn jobs, example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn/1420661062511890432 Testgrid: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn
Yes, the message changed recently via https://github.com/ovn-org/ovn-kubernetes/pull/2275. Previously that message was printed for many reasons, a good number of them unrelated to OVN actually setting ovn-installed. For example, waiting too long for short-lived pods that had already been torn down and recreated by the apiserver (happens in the StatefulSet tests). The messages should be much more accurate now and point to the actual problem. If we do see messages about ovn-installed now, it's likely thats the real culprit. But taking one run there: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn/1420661062511890432 At least one instance of ovn-installed failure for the myphp-3-build pod is because the database transferred RAFT leadership to snapshot, and northd didn't reconnect to the right DB in time; that is a known problem and will be fixed in https://github.com/openshift/ovn-kubernetes/pull/630. I suspect many more occurrences are due to this.
There are two possible error messages now: "timed out waiting for annotations" "timed out waiting for OVS port binding" According to ci search, there are still instances of the first one: https://search.ci.openshift.org/?search=timed+out+waiting+for+annotations&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job But I dont see any instances of the port binding: https://search.ci.openshift.org/?search=timed+out+while+waiting+for+OVS+port+binding&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job The timed out waiting for annotations is most likely ovnkube-master performance related, which has a bug and a bunch of fixes that just went in: https://github.com/openshift/ovn-kubernetes/pull/667 https://bugzilla.redhat.com/show_bug.cgi?id=1959352 Duping this bug to https://bugzilla.redhat.com/show_bug.cgi?id=1959200
*** This bug has been marked as a duplicate of bug 1959200 ***