+++ This bug was initially created as a clone of Bug #1886922 +++ test: [sig-network] pods should successfully create sandboxes by getting pod is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-network%5C%5D+pods+should+successfully+create+sandboxes+by+getting+pod FIXME: Replace this paragraph with a particular job URI from the search results to ground discussion. A given test may fail for several reasons, and this bug should be scoped to one of those reasons. Ideally you'd pick a job showing the most-common reason, but since that's hard to determine, you may also chose to pick a job at random. Release-gating jobs (release-openshift-...) should be preferred over presubmits (pull-ci-...) because they are closer to the released product and less likely to have in-flight code changes that complicate analysis. FIXME: Provide a snippet of the test failure or error from the job log --- Additional comment from Jan Chaloupka on 2020-10-12 12:27:34 UTC --- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci/1315588345421631488 kubelet logs from masters contain: ``` Oct 12 11:06:54.725413 ci-op-smi1zdxn-a7421-k5w9w-master-0 hyperkube[1536]: I1012 11:06:54.725392 1536 kubelet.go:1680] Creating a mirror pod for static pod "recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra(edb4583065f51a0ee864e663b2289f3a)" Oct 12 11:06:54.737018 ci-op-smi1zdxn-a7421-k5w9w-master-0 hyperkube[1536]: E1012 11:06:54.736316 1536 kubelet.go:1682] Failed creating a mirror pod for "recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra(edb4583065f51a0ee864e663b2289f3a)": pods "recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0" is forbidden: a mirror pod may not reference service accounts ... Oct 12 11:10:40.771726 ci-op-smi1zdxn-a7421-k5w9w-master-0 hyperkube[1536]: E1012 11:10:40.771514 1536 pod_workers.go:191] Error syncing pod f9f51de44362b05394788110d92dcade ("recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra(f9f51de44362b05394788110d92dcade)"), skipping: failed to "CreatePodSandbox" for "recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra(f9f51de44362b05394788110d92dcade)" with CreatePodSandboxError: "CreatePodSandbox for pod \"recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra(f9f51de44362b05394788110d92dcade)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra_f9f51de44362b05394788110d92dcade_0(ba794b471acdc719df6c91675df75d1a70ebebaf90e6b6513873359b50769c13): Multus: [openshift-infra/recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0]: error getting pod: pods \"recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0\" not found" ... ``` --- Additional comment from Jan Chaloupka on 2020-10-12 12:35:38 UTC --- There are other "similar" tests failing which might be caused by the same underlying issue: - [sig-network] pods should successfully create sandboxes by other - [sig-network] pods should successfully create sandboxes by reading container - [sig-network] pods should successfully create sandboxes by writing network status --- Additional comment from Jan Chaloupka on 2020-10-12 12:38:13 UTC --- - [sig-network] pods should successfully create sandboxes by writing child --- Additional comment from Aniket Bhat on 2020-10-12 14:41:21 UTC --- Assigning to Tim. --- Additional comment from Eric Paris on 2020-10-12 15:00:26 UTC --- This bug has set a target release without specifying a severity. As part of triage when determining the importance of bugs a severity should be specified. Since these bugs have not been properly triaged we are removing the target release. Teams will need to add a severity before setting the target release again. --- Additional comment from Tim Rozet on 2020-10-12 19:25:33 UTC --- There looks to be a problem with the test itself that it categorizes "error getting pod: pods "pod1" not found" as a failure, when really that is usually because a pod was deleted right after it was created, so the sandbox creation cant find it anymore. An example here: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/301/pull-ci-openshift-ovn-kubernetes-master-e2e-ovn-hybrid-step-registry/1315650856250511360 ns/e2e-gc-8508 pod/pod1 node/ip-10-0-186-62.us-west-2.compute.internal - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod1_e2e-gc-8508_ef0dae73-c25e-4216-a1ab-3a0d06f1c5d4_0(ef9f1f04d668e4b76df25d24b6816f3dce2d44ebef00ff523caf2623e092af3e): Multus: [e2e-gc-8508/pod1]: error getting pod: pods "pod1" not found I1012 14:33:05.508991 1 scheduler.go:597] "Successfully bound pod to node" pod="e2e-gc-8508/pod1" node="ip-10-0-186-62.us-west-2.compute.internal" evaluatedNodes=6 feasibleNodes=3 ovnkube adds the port, then sees a delete 1 second later: I1012 14:33:06.305650 1 pods.go:78] Deleting pod: e2e-gc-8508/pod1 We should fix the test to remove false positives, then recheck the CI afterwards. David, can you create a bug/fix for this test and link it to this bug please? --- Additional comment from Alexander Constantinescu on 2020-10-26 15:33:00 UTC --- --- Additional comment from Emily Moss on 2020-11-07 00:01:05 UTC --- Build watcher saw this today on 4.7 ci --- Additional comment from Tim Rozet on 2020-11-09 21:17:07 UTC --- David updated the test case: https://github.com/openshift/origin/commit/06a30a752a2362ca60ae17d584565d3ce92e14e0#diff-a8882c52a9bd2ef6a4b8294fb52ec3342d37aa809581a438c3fabd02bcc63467 and now I see some more legit failures that need to be investigated: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/317/pull-ci-openshift-ovn-kubernetes-master-e2e-ovn-hybrid-step-registry/1325860268961435648/ 23.80 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_webserver-57656d96fb-8ckmc_e2e-deployment-9222_7312d362-35aa-404a-ab7a-dca63ab41d1e_0(fa2324e91cfb4a05e3773d9ec7db826fe4781574eca6879c02bf88833317df68): [e2e-deployment-9222/webserver-57656d96fb-8ckmc:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[e2e-deployment-9222/webserver-57656d96fb-8ckmc] failed to configure pod interface: timed out waiting for pod flows for pod: webserver-57656d96fb-8ckmc, error: timed out waiting for the condition I see in ovnkube-master: I1109 19:00:35.952244 1 informer.go:294] Successfully synced 'e2e-deployment-9222/webserver-57656d96fb-8ckmc' I1109 19:00:36.082358 1 pods.go:272] [e2e-deployment-9222/webserver-57656d96fb-8ckmc] addLogicalPort took 90.186464ms I1109 19:00:40.313403 1 pods.go:78] Deleting pod: e2e-deployment-9222/webserver-57656d96fb-8ckmc I1109 19:00:40.313416 1 informer.go:294] Successfully synced 'e2e-deployment-9222/webserver-57656d96fb-8ckmc' The pod was added and deleted in less than 5 seconds, but we see on the CNI side: I1109 19:00:39.783451 3500 cniserver.go:147] Waiting for ADD result for pod e2e-deployment-9222/webserver-57656d96fb-8ckmc I1109 19:00:39.783475 3500 cni.go:147] [e2e-deployment-9222/webserver-57656d96fb-8ckmc] dispatching pod network request &{ADD e2e-deployment-9222 webserver-57656d96fb-8ckmc fa2324e91cfb4a05e3773d9ec7db826fe4781574eca6879c02bf88833317df68 /var/run/netns/108d17e4-4686-4d3e-8bac-b713a047cc5e eth0 0xc00067ff00} I1109 19:00:41.585640 3500 informer.go:294] Successfully synced 'e2e-deployment-9222/webserver-57656d96fb-8ckmc' I1109 19:01:01.524480 3500 cni.go:157] [e2e-deployment-9222/webserver-85f6b8449-f9zxm] CNI request &{DEL e2e-deployment-9222 webserver-85f6b8449-f9zxm ed235ef6f42896a855c0bee6467a988b0775f0d28f089440b628fad8974a1606 /var/run/netns/1be4344f-79f1-4493-ad2f-8d2f092c052a eth0 0xc000646900}, result "", err <nil> I1109 19:01:01.813063 3500 cni.go:157] [e2e-deployment-9222/webserver-57656d96fb-8ckmc] CNI request &{ADD e2e-deployment-9222 webserver-57656d96fb-8ckmc fa2324e91cfb4a05e3773d9ec7db826fe4781574eca6879c02bf88833317df68 /var/run/netns/108d17e4-4686-4d3e-8bac-b713a047cc5e eth0 0xc00067ff00}, result "", err failed to configure pod interface: timed out waiting for pod flows for pod: webserver-57656d96fb-8ckmc, error: timed out waiting for the condition For some reason the delete came in 23 seconds later...investigating --- Additional comment from Tim Rozet on 2020-11-11 14:58:35 UTC --- So it turns out in OVN we could have a case where we have a series of events like this: 1. pod add happens (ovnkube-node is waiting for annnotation from ovnkube-master) 2. ovnkube-master annotates pod 3. ovnkube-node begins cmdAdd CNI 4. pod delete happens (ovnkube-master deletes the LSP) 5. ovnkube-node in cmdAdd is stuck waiting for 20 seconds for openflow flow to show up for the now deleted pod This results in a failed to add pod time of over 20 seconds. Really we should be able in ovnkube-node to detect the pod was deleted somehow and then cancel the waiting for flows. Similarly we also wait for annotation for some time, we could cancel the wait there as well. --- Additional comment from Fabiano Franz on 2020-11-11 16:54:57 UTC --- Correlated - [sig-network] pods should successfully create sandboxes by not timing out --- Additional comment from Antonio Ojea on 2020-11-13 13:54:25 UTC --- --- Additional comment from on 2020-11-27 07:47:09 UTC --- Still happens frequently in the last 24 hours: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-ocp-installer-e2e-aws-mirrors-4.7 e.g.: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-mirrors-4.7/1332044148731023360 --- Additional comment from Tim Rozet on 2020-12-16 02:09:47 UTC --- The new failures look like they have nothing to do with ovn-kubernetes. I can see this happens: Dec 12 05:55:13.781773 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:13.781182 1683 kubelet.go:1834] SyncLoop (ADD, "api"): "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)" Dec 12 05:55:13.807757 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:13.801436 1683 kubelet.go:1841] SyncLoop (UPDATE, "api"): "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)" Dec 12 05:55:13.862733 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:13.862183 1683 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-dw6sj" (UniqueName: "kubernetes.io/secret/756b44df-f705-445a-9e24-bf1e126b6c5a-default-token-dw6sj") pod "pod-submit-status-2-4" (UID: "756b44df-f705-445a-9e24-bf1e126b6c5a") Dec 12 05:55:14.045725 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.045445 1683 kubelet.go:1850] SyncLoop (DELETE, "api"): "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)" Dec 12 05:55:14.058721 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.058498 1683 reconciler.go:269] operationExecutor.MountVolume started for volume "default-token-dw6sj" (UniqueName: "kubernetes.io/secret/756b44df-f705-445a-9e24-bf1e126b6c5a-default-token-dw6sj") pod "pod-submit-status-2-4" (UID: "756b44df-f705-445a-9e24-bf1e126b6c5a") Dec 12 05:55:14.090579 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.089034 1683 kubelet.go:1850] SyncLoop (DELETE, "api"): "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)" Dec 12 05:55:14.096983 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.096951 1683 kubelet.go:1844] SyncLoop (REMOVE, "api"): "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)" Dec 12 05:55:14.100720 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.097149 1683 kubelet.go:2038] Failed to delete pod "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)", err: pod not found Dec 12 05:55:14.299745 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.299337 1683 operation_generator.go:663] MountVolume.SetUp succeeded for volume "default-token-dw6sj" (UniqueName: "kubernetes.io/secret/756b44df-f705-445a-9e24-bf1e126b6c5a-default-token-dw6sj") pod "pod-submit-status-2-4" (UID: "756b44df-f705-445a-9e24-bf1e126b6c5a") Dec 12 05:55:14.584995 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.581377 1683 kuberuntime_manager.go:425] No sandbox for pod "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)" can be found. Need to start a new one Dec 12 05:55:14.590742 ip-10-0-168-106 crio[1648]: time="2020-12-12 05:55:14.589448461Z" level=info msg="Running pod sandbox: e2e-pods-5892/pod-submit-status-2-4/POD" id=9eb340e4-0a7b-419b-a2b0-025802037287 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Dec 12 05:55:17.936748 ip-10-0-168-106 crio[1648]: time="2020-12-12 05:55:17.934805119Z" level=info msg="Got pod network &{Name:pod-submit-status-2-4 Namespace:e2e-pods-5892 ID:f42209c3922dcdfe5692971a9ac3fc47532efb7690b2fdd9b4e061a1efd3e2ef NetNS:/var/run/netns/cd7e6f4b-1650-45ab-ba60-3126086c6a32 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}]}" Dec 12 05:55:18.009733 ip-10-0-168-106 crio[1648]: time="2020-12-12 05:55:18.005337584Z" level=error msg="Error adding network: Multus: [e2e-pods-5892/pod-submit-status-2-4]: error getting pod: pods \"pod-submit-status-2-4\" not found" tl;dr the pod is deleted right after it is created, and kube runtime still tries to make the sandbox anyway. At that point crio calls multus and multus cannot find the pod so it errors out. ovn-kubernetes never even gets to CNI ADD the pod. This looks like either expected behavior and the test needs to be tweaked or a bug in kubelet.
Seems to be cropping up again. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-vsphere-upi/1347487017306427392 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.7/1347586669322178560 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.7/1347592202766782464
Build watcher: This bug is linked to the top four failing jobs for release 4.7, https://sippy.ci.openshift.org/?release=4.7#TopFailingTestsWithABug
On ovirt the: [sig-network] pods should successfully create.... Tests are always failing, at least one of them each run. I know they are ignored in terms of the job passing or not, but it is still annoying. I saw this is the only bug that is open, but do we know why? is it a real problem that needs to be fixed or just a problem in the test? For reference: https://search.ci.openshift.org/?search=pods+should+successfully+create+sandboxes+&maxAge=336h&context=1&type=junit&name=release-openshift-ocp-installer-e2e-ovirt-4.7&maxMatches=5&maxBytes=20971520&groupBy=job
Some of those errors from CI in the last comment look like this: ns/e2e-pods-4806 pod/pod-submit-status-2-0 node/ovirt12-xw2l5-worker-0-9z6pb - 21.03 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Tracking that at https://github.com/kubernetes/kubernetes/issues/98142 / https://bugzilla.redhat.com/show_bug.cgi?id=1915085 That is different than the errors linked in the initial report, fwiw, which look like ns/e2e-deployment-5578 pod/webserver-9569696c8-tgks7 node/ci-op-2wxd6sh9-56515-7jz4t-worker-6xxnz - 21.99 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_webserver-9569696c8-tgks7_e2e-deployment-5578_dbb1018e-5a39-4710-94bb-b193ae9718ec_0(c407d51fd9da8c86c345bd42c067a222ccdd2513ac48ff9b03309e26bd743204): [e2e-deployment-5578/webserver-9569696c8-tgks7:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'pods "webserver-9569696c8-tgks7" not found
This is another test that involves a lot of pods being created and then deleted quickly. I suspect this is a manifestation of the race condition in #1915085. Will track there. *** This bug has been marked as a duplicate of bug 1915085 ***
I see this was closed as duplicate, but https://bugzilla.redhat.com/show_bug.cgi?id=1915085 is verified and CI is still failing (see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-ovn-4.7/1361939592273465344 ) Reopening the bz
Following comment#24, creating and deleting a static pod prior to start work without errors. 4.8.0-0.nightly-2021-03-22-104536
Should this be closed now?
I see in the linked gcp-rt job that there is a long amount of time before kube runtime tries to create the sandbox: Mar 30 09:14:49.322899 ci-op-6y6zi1g1-134e7-dc6xh-worker-d-whddz hyperkube[1554]: I0330 09:14:49.322891 1554 kubelet.go:1920] SyncLoop (ADD, "api"): "webserver-fc8b59899-w2tvg_e2e-deployment-427(d5e58a4b-c3ac-4bf4-b6b0-13e8cb4142bf)" Mar 30 09:15:09.297827 ci-op-6y6zi1g1-134e7-dc6xh-worker-d-whddz crio[1524]: time="2021-03-30 09:15:09.297415658Z" level=info msg="Running pod sandbox: e2e-deployment-427/webserver-fc8b59899-w2tvg/POD" id=2d15b70d-461a-46ec-8a99-ee0e9d9a6249 Mar 30 09:15:09.574818 ci-op-6y6zi1g1-134e7-dc6xh-worker-d-whddz crio[1524]: time="2021-03-30 09:15:09.574201273Z" level=error msg="Error adding network: Multus: [e2e-deployment-427/webserver-fc8b59899-w2tvg]: error getting pod: pods \"webserver-fc8b59899-w2tvg\" not found" The question is why does it take kube runtime so long to start the sandbox. Ryan took a look at this with me and we think there is some type of network outage trying to get to the API service. Either way the gcp-rt job looks pretty unstable with a lot of failures and on the other gcp nightly job I dont see the failure. I think it's OK to close the bug for now and if we see it again reopen.
*** Bug 1927102 has been marked as a duplicate of this bug. ***
Checking some upgrade jobs between 4.7 to 4.8 today I see a lot of [sig-network] pods should successfully create sandboxes by other Which linked to this BZ. Recent examples: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1377201785415929856 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1377172439389179904 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377172439347236864 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377218187266887680 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377257801742553088 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1377233268348293120 Reopening with higher priority, is this the same issue being observed?
Hi Yu Qi Zhang, I just looked at the CI runs linked and none of the pods affected are static pods. Updating the title to clarify.
Hi, Thanks for clarifying. I see you've also moved this to verified. Is that associated with any PRs merging or a bump in upstream kube?
Hi Yu Qi, Yes, see the linked PRs. This BZ was previously in VERIFIED when you moved it back to NEW. I returned it to its previous status.
I've spun out a new bug 1948066 for Jerry's 4.7->4.8 update issue from comment 29.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438