Bug 1968009
Summary: | OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | jamo luhrsen <jluhrsen> |
Component: | Networking | Assignee: | Tim Rozet <trozet> |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | achernet, bbennett, dcbw, ealcaniz, mmahmoud, stbenjam, trozet, vrutkovs |
Version: | 4.8 | Flags: | trozet:
needinfo-
|
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-08-20 19:13:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
jamo luhrsen
2021-06-04 17:33:27 UTC
This is most likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1959200 Once that fix is merged we can verify it. Just a drive-by quick investigation... picking one example "redhat-marketplace-kn6v8": ~14:31:00 ovnkube-master leader is held by master-1 until 14:42:11 we have no logs of this because must-gather didn't get them; the first logs we have of master-1 are at 14:45:38: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wk4fl_ovnkube-master.log This is probably the pre-upgrade ovnkube-master pod which, of course, must-gather doesn't get logs for :( I don't think we can debug much further, because we don't know what that ovnkube-master is doing during this time... 14:31:49 - pod scheduled "message": "Successfully assigned openshift-marketplace/redhat-marketplace-kn6v8 to ci-op-37wt74n6-8929c-srngm-worker-c-frb96", 14:31:50 - kubelet on node worker-c-frb96 starts pod sandbox https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/nodes/ci-op-37wt74n6-8929c-srngm-worker-c-frb96/ Jun 04 14:31:50.278687 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: time="2021-06-04 14:31:50.274051727Z" level=info msg="Running pod sandbox: openshift-marketplace/redhat-marketplace-kn6v8/POD" id=89fc87d9-be79-43ef-bed9-2ad4c027efe8 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox 14:32:01 - CRIO gets the request Jun 04 14:32:01.011199 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: time="2021-06-04 14:32:01.011127668Z" level=info msg="Got pod network &{Name:redhat-marketplace-kn6v8 Namespace:openshift-marketplace ID:b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1 NetNS:/var/run/netns/c6303f1f-39c3-4e02-8460-d805a9e50e99 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" 14:32:01 - ovnkube-node gets CNI ADD https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovnkube-node.log 14:32:21 - CRIO ADD request times out Jun 04 14:32:21.528729 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: time="2021-06-04 14:32:21.528625889Z" level=error msg="Error adding network: [openshift-marketplace/redhat-marketplace-kn6v8:ovn-kubernetes]: error adding container to network \"ovn-kubernetes\": CNI request failed with status 400: '[openshift-marketplace/redhat-marketplace-kn6v8 b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1] [openshift-marketplace/redhat-marketplace-kn6v8 b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding\n'" 14:33:03 - ovn-controller claims port https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovn-controller.log The big question is why ovn-controller on the scheduled node didn't claim the port until 14:33:03, but we can't figure that out because: 1) we don't have ovnkube-master logs from the leader during the period the pod was scheduled 2) we don't have more verbose logging from nbdb or sbdb or northd that would show when ovnkube-master got around to adding the pod LSP (In reply to Dan Williams from comment #2) > Just a drive-by quick investigation... picking one example > "redhat-marketplace-kn6v8": > > ~14:31:00 ovnkube-master leader is held by master-1 until 14:42:11 > we have no logs of this because must-gather didn't get them; the first logs > we have of master-1 are at 14:45:38: > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp- > ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/ > artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wk4fl_ovnkube-master. > log > > This is probably the pre-upgrade ovnkube-master pod which, of course, > must-gather doesn't get logs for :( Upgrade logs are stored at Loki, here are the ovn-kube master logs before upgrade (at initial setup): https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221622811450000%22,%221622811750000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade%2F1400795131786825728%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-ovn-kubernetes%5C%22%20%7C%20pod_name%3D~%5C%22ovnkube-master.*%5C%22%22%7D%5D > > I don't think we can debug much further, because we don't know what that > ovnkube-master is doing during this time... > > 14:31:49 - pod scheduled > "message": "Successfully assigned > openshift-marketplace/redhat-marketplace-kn6v8 to > ci-op-37wt74n6-8929c-srngm-worker-c-frb96", > > 14:31:50 - kubelet on node worker-c-frb96 starts pod sandbox > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp- > ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/ > artifacts/nodes/ci-op-37wt74n6-8929c-srngm-worker-c-frb96/ > Jun 04 14:31:50.278687 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: > time="2021-06-04 14:31:50.274051727Z" level=info msg="Running pod sandbox: > openshift-marketplace/redhat-marketplace-kn6v8/POD" > id=89fc87d9-be79-43ef-bed9-2ad4c027efe8 > name=/runtime.v1alpha2.RuntimeService/RunPodSandbox > > 14:32:01 - CRIO gets the request > Jun 04 14:32:01.011199 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: > time="2021-06-04 14:32:01.011127668Z" level=info msg="Got pod network > &{Name:redhat-marketplace-kn6v8 Namespace:openshift-marketplace > ID:b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1 > NetNS:/var/run/netns/c6303f1f-39c3-4e02-8460-d805a9e50e99 Networks:[] > RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] > Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" > > 14:32:01 - ovnkube-node gets CNI ADD > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp- > ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/ > artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovnkube-node.log > > 14:32:21 - CRIO ADD request times out > Jun 04 14:32:21.528729 ci-op-37wt74n6-8929c-srngm-worker-c-frb96 crio[1598]: > time="2021-06-04 14:32:21.528625889Z" level=error msg="Error adding network: > [openshift-marketplace/redhat-marketplace-kn6v8:ovn-kubernetes]: error > adding container to network \"ovn-kubernetes\": CNI request failed with > status 400: '[openshift-marketplace/redhat-marketplace-kn6v8 > b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1] > [openshift-marketplace/redhat-marketplace-kn6v8 > b3362becec6f41a4b4ccbabf78205fa134967ae51a9bb9b20d7e764fac1ed6c1] failed to > configure pod interface: error while waiting on > OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting > for OVS port binding\n'" > > 14:33:03 - ovn-controller claims port > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp- > ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/ > artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-5kwgt_ovn-controller.log > > > The big question is why ovn-controller on the scheduled node didn't claim > the port until 14:33:03, but we can't figure that out because: > 1) we don't have ovnkube-master logs from the leader during the period the > pod was scheduled > 2) we don't have more verbose logging from nbdb or sbdb or northd that would > show when ovnkube-master got around to adding the pod LSP (In reply to Vadim Rutkovsky from comment #3) > (In reply to Dan Williams from comment #2) > > Just a drive-by quick investigation... picking one example > > "redhat-marketplace-kn6v8": > > > > ~14:31:00 ovnkube-master leader is held by master-1 until 14:42:11 > > we have no logs of this because must-gather didn't get them; the first logs > > we have of master-1 are at 14:45:38: > > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > > periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp- > > ovn-upgrade/1400795131786825728/artifacts/e2e-gcp-ovn-upgrade/gather-extra/ > > artifacts/pods/openshift-ovn-kubernetes_ovnkube-master-wk4fl_ovnkube-master. > > log > > > > This is probably the pre-upgrade ovnkube-master pod which, of course, > > must-gather doesn't get logs for :( > > Upgrade logs are stored at Loki, here are the ovn-kube master logs before > upgrade (at initial setup): > https://grafana-loki.ci.openshift.org/ > explore?orgId=1&left=%5B%221622811450000%22,%221622811750000%22, > %22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal- > ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e- > gcp-ovn- > upgrade%2F1400795131786825728%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C > %22openshift-ovn-kubernetes%5C%22%20%7C%20pod_name%3D~%5C%22ovnkube-master. > *%5C%22%22%7D%5D I don't suppose we can ever get those into must-gather somehow? Thanks for the Loki link... Though that only returns ~900 lines, while I'd expect quite a few more. The timestamps there don't correspond to anything in must-gather either, they appear to end a full hour+ (12:59:xx) before the upgraded pods start? Jamo, can you check the jobs for 4.9 upgrade now? We had fixes go in for 1959200. (In reply to Tim Rozet from comment #5) > Jamo, can you check the jobs for 4.9 upgrade now? We had fixes go in for > 1959200. tl;dr, I think this is resolved So, just looking for the symptom of "OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding" message seen in a failure of the "pods should successfully create sandboxes by other": 1. the last time it was seen in a periodic job was 10 days ago [0] 2. the downstream merge [1] happened 9 days ago one thing I noticed and not sure how important it is or not, is that one PR [2] that landed for 1959200 didn't seem to make it in to the downstream merge. there are 3 mentioned in 1959200 and only two are there: ❯ git lg | rg 'pass Kube API auth via|cni: cancel old pod sandbox add|Adds back checking OF flows for CNI' 7dd78fc7 2021-07-17 cni: pass Kube API auth via cnishim response, not CNI config file [Dan Williams] 22bed6a1 2021-05-12 Adds back checking OF flows for CNI [Tim Rozet] I know you know, but all/some of these things all need to be backported. We still see the same symptoms currently in 4.8 [3] [0] https://search.ci.openshift.org/?search=timed+out+while+waiting+for+OVS+port+binding&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job [1] https://github.com/openshift/ovn-kubernetes/pull/609 [2] https://github.com/ovn-org/ovn-kubernetes/pull/2275 [3] https://search.ci.openshift.org/?search=timed+out+while+waiting+for+OVS+port+binding&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Did the message change slightly? I see "timed out waiting for OVS port binding (ovn-installed)" on e2e-aws-ovn jobs, example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn/1420661062511890432 Testgrid: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn Yes, the message changed recently via https://github.com/ovn-org/ovn-kubernetes/pull/2275. Previously that message was printed for many reasons, a good number of them unrelated to OVN actually setting ovn-installed. For example, waiting too long for short-lived pods that had already been torn down and recreated by the apiserver (happens in the StatefulSet tests). The messages should be much more accurate now and point to the actual problem. If we do see messages about ovn-installed now, it's likely thats the real culprit. But taking one run there: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn/1420661062511890432 At least one instance of ovn-installed failure for the myphp-3-build pod is because the database transferred RAFT leadership to snapshot, and northd didn't reconnect to the right DB in time; that is a known problem and will be fixed in https://github.com/openshift/ovn-kubernetes/pull/630. I suspect many more occurrences are due to this. There are two possible error messages now: "timed out waiting for annotations" "timed out waiting for OVS port binding" According to ci search, there are still instances of the first one: https://search.ci.openshift.org/?search=timed+out+waiting+for+annotations&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job But I dont see any instances of the port binding: https://search.ci.openshift.org/?search=timed+out+while+waiting+for+OVS+port+binding&maxAge=336h&context=1&type=bug%2Bjunit&name=periodic.*4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job The timed out waiting for annotations is most likely ovnkube-master performance related, which has a bug and a bunch of fixes that just went in: https://github.com/openshift/ovn-kubernetes/pull/667 https://bugzilla.redhat.com/show_bug.cgi?id=1959352 Duping this bug to https://bugzilla.redhat.com/show_bug.cgi?id=1959200 *** This bug has been marked as a duplicate of bug 1959200 *** |