Bug 1886922 - [sig-network] pods should successfully create sandboxes by getting pod
Summary: [sig-network] pods should successfully create sandboxes by getting pod
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Tim Rozet
QA Contact: Anurag saxena
URL:
Whiteboard:
: 1891519 1895534 (view as bug list)
Depends On:
Blocks: 1908378 1929674
TreeView+ depends on / blocked
 
Reported: 2020-10-09 18:09 UTC by Jing Zhang
Modified: 2021-03-05 17:30 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1908378 (view as bug list)
Environment:
[sig-network] pods should successfully create sandboxes by getting pod
Last Closed: 2021-02-24 15:24:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ovn-org ovn-kubernetes pull 1831 0 None closed CNI: quit waiting for pod flows for obsoleted ADD 2021-02-17 09:22:41 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:25:21 UTC

Description Jing Zhang 2020-10-09 18:09:54 UTC
test:
[sig-network] pods should successfully create sandboxes by getting pod 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-network%5C%5D+pods+should+successfully+create+sandboxes+by+getting+pod


FIXME: Replace this paragraph with a particular job URI from the search results to ground discussion.  A given test may fail for several reasons, and this bug should be scoped to one of those reasons.  Ideally you'd pick a job showing the most-common reason, but since that's hard to determine, you may also chose to pick a job at random.  Release-gating jobs (release-openshift-...) should be preferred over presubmits (pull-ci-...) because they are closer to the released product and less likely to have in-flight code changes that complicate analysis.

FIXME: Provide a snippet of the test failure or error from the job log

Comment 1 Jan Chaloupka 2020-10-12 12:27:34 UTC
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci/1315588345421631488

kubelet logs from masters contain:
```
Oct 12 11:06:54.725413 ci-op-smi1zdxn-a7421-k5w9w-master-0 hyperkube[1536]: I1012 11:06:54.725392    1536 kubelet.go:1680] Creating a mirror pod for static pod "recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra(edb4583065f51a0ee864e663b2289f3a)"
Oct 12 11:06:54.737018 ci-op-smi1zdxn-a7421-k5w9w-master-0 hyperkube[1536]: E1012 11:06:54.736316    1536 kubelet.go:1682] Failed creating a mirror pod for "recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra(edb4583065f51a0ee864e663b2289f3a)": pods "recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0" is forbidden: a mirror pod may not reference service accounts
...
Oct 12 11:10:40.771726 ci-op-smi1zdxn-a7421-k5w9w-master-0 hyperkube[1536]: E1012 11:10:40.771514    1536 pod_workers.go:191] Error syncing pod f9f51de44362b05394788110d92dcade ("recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra(f9f51de44362b05394788110d92dcade)"), skipping: failed to "CreatePodSandbox" for "recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra(f9f51de44362b05394788110d92dcade)" with CreatePodSandboxError: "CreatePodSandbox for pod \"recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra(f9f51de44362b05394788110d92dcade)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0_openshift-infra_f9f51de44362b05394788110d92dcade_0(ba794b471acdc719df6c91675df75d1a70ebebaf90e6b6513873359b50769c13): Multus: [openshift-infra/recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0]: error getting pod: pods \"recyler-pod-ci-op-smi1zdxn-a7421-k5w9w-master-0\" not found"
...
```

Comment 2 Jan Chaloupka 2020-10-12 12:35:38 UTC
There are other "similar" tests failing which might be caused by the same underlying issue:
- [sig-network] pods should successfully create sandboxes by other
- [sig-network] pods should successfully create sandboxes by reading container
- [sig-network] pods should successfully create sandboxes by writing network status

Comment 3 Jan Chaloupka 2020-10-12 12:38:13 UTC
- [sig-network] pods should successfully create sandboxes by writing child

Comment 4 Aniket Bhat 2020-10-12 14:41:21 UTC
Assigning to Tim.

Comment 6 Tim Rozet 2020-10-12 19:25:33 UTC
There looks to be a problem with the test itself that it categorizes "error getting pod: pods "pod1" not found" as a failure, when really that is usually because a pod was deleted right after it was created, so the sandbox creation cant find it anymore. An example here:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/301/pull-ci-openshift-ovn-kubernetes-master-e2e-ovn-hybrid-step-registry/1315650856250511360

ns/e2e-gc-8508 pod/pod1 node/ip-10-0-186-62.us-west-2.compute.internal - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod1_e2e-gc-8508_ef0dae73-c25e-4216-a1ab-3a0d06f1c5d4_0(ef9f1f04d668e4b76df25d24b6816f3dce2d44ebef00ff523caf2623e092af3e): Multus: [e2e-gc-8508/pod1]: error getting pod: pods "pod1" not found

I1012 14:33:05.508991       1 scheduler.go:597] "Successfully bound pod to node" pod="e2e-gc-8508/pod1" node="ip-10-0-186-62.us-west-2.compute.internal" evaluatedNodes=6 feasibleNodes=3

ovnkube adds the port, then sees a delete 1 second later:
I1012 14:33:06.305650       1 pods.go:78] Deleting pod: e2e-gc-8508/pod1

We should fix the test to remove false positives, then recheck the CI afterwards. David, can you create a bug/fix for this test and link it to this bug please?

Comment 7 Alexander Constantinescu 2020-10-26 15:33:00 UTC
*** Bug 1891519 has been marked as a duplicate of this bug. ***

Comment 9 Tim Rozet 2020-11-09 21:17:07 UTC
David updated the test case:
https://github.com/openshift/origin/commit/06a30a752a2362ca60ae17d584565d3ce92e14e0#diff-a8882c52a9bd2ef6a4b8294fb52ec3342d37aa809581a438c3fabd02bcc63467


and now I see some more legit failures that need to be investigated:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/317/pull-ci-openshift-ovn-kubernetes-master-e2e-ovn-hybrid-step-registry/1325860268961435648/

23.80 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_webserver-57656d96fb-8ckmc_e2e-deployment-9222_7312d362-35aa-404a-ab7a-dca63ab41d1e_0(fa2324e91cfb4a05e3773d9ec7db826fe4781574eca6879c02bf88833317df68): [e2e-deployment-9222/webserver-57656d96fb-8ckmc:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[e2e-deployment-9222/webserver-57656d96fb-8ckmc] failed to configure pod interface: timed out waiting for pod flows for pod: webserver-57656d96fb-8ckmc, error: timed out waiting for the condition

I see in ovnkube-master:
I1109 19:00:35.952244       1 informer.go:294] Successfully synced 'e2e-deployment-9222/webserver-57656d96fb-8ckmc'
I1109 19:00:36.082358       1 pods.go:272] [e2e-deployment-9222/webserver-57656d96fb-8ckmc] addLogicalPort took 90.186464ms
I1109 19:00:40.313403       1 pods.go:78] Deleting pod: e2e-deployment-9222/webserver-57656d96fb-8ckmc
I1109 19:00:40.313416       1 informer.go:294] Successfully synced 'e2e-deployment-9222/webserver-57656d96fb-8ckmc'


The pod was added and deleted in less than 5 seconds, but we see on the CNI side:

I1109 19:00:39.783451    3500 cniserver.go:147] Waiting for ADD result for pod e2e-deployment-9222/webserver-57656d96fb-8ckmc
I1109 19:00:39.783475    3500 cni.go:147] [e2e-deployment-9222/webserver-57656d96fb-8ckmc] dispatching pod network request &{ADD e2e-deployment-9222 webserver-57656d96fb-8ckmc fa2324e91cfb4a05e3773d9ec7db826fe4781574eca6879c02bf88833317df68 /var/run/netns/108d17e4-4686-4d3e-8bac-b713a047cc5e eth0 0xc00067ff00}
I1109 19:00:41.585640    3500 informer.go:294] Successfully synced 'e2e-deployment-9222/webserver-57656d96fb-8ckmc'

I1109 19:01:01.524480    3500 cni.go:157] [e2e-deployment-9222/webserver-85f6b8449-f9zxm] CNI request &{DEL e2e-deployment-9222 webserver-85f6b8449-f9zxm ed235ef6f42896a855c0bee6467a988b0775f0d28f089440b628fad8974a1606 /var/run/netns/1be4344f-79f1-4493-ad2f-8d2f092c052a eth0 0xc000646900}, result "", err <nil>
I1109 19:01:01.813063    3500 cni.go:157] [e2e-deployment-9222/webserver-57656d96fb-8ckmc] CNI request &{ADD e2e-deployment-9222 webserver-57656d96fb-8ckmc fa2324e91cfb4a05e3773d9ec7db826fe4781574eca6879c02bf88833317df68 /var/run/netns/108d17e4-4686-4d3e-8bac-b713a047cc5e eth0 0xc00067ff00}, result "", err failed to configure pod interface: timed out waiting for pod flows for pod: webserver-57656d96fb-8ckmc, error: timed out waiting for the condition

For some reason the delete came in 23 seconds later...investigating

Comment 10 Tim Rozet 2020-11-11 14:58:35 UTC
So it turns out in OVN we could have a case where we have a series of events like this:
1. pod add happens (ovnkube-node is waiting for annnotation from ovnkube-master)
2. ovnkube-master annotates pod
3. ovnkube-node begins cmdAdd CNI
4. pod delete happens (ovnkube-master deletes the LSP)
5. ovnkube-node in cmdAdd is stuck waiting for 20 seconds for openflow flow to show up for the now deleted pod

This results in a failed to add pod time of over 20 seconds. Really we should be able in ovnkube-node to detect the pod was deleted somehow and then cancel the waiting for flows. Similarly we also wait for annotation for some time, we could cancel the wait there as well.

Comment 11 Fabiano Franz 2020-11-11 16:54:57 UTC
Correlated

- [sig-network] pods should successfully create sandboxes by not timing out

Comment 12 Antonio Ojea 2020-11-13 13:54:25 UTC
*** Bug 1895534 has been marked as a duplicate of this bug. ***

Comment 14 Tim Rozet 2020-12-16 02:09:47 UTC
The new failures look like they have nothing to do with ovn-kubernetes. I can see this happens:
Dec 12 05:55:13.781773 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:13.781182    1683 kubelet.go:1834] SyncLoop (ADD, "api"): "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)"
Dec 12 05:55:13.807757 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:13.801436    1683 kubelet.go:1841] SyncLoop (UPDATE, "api"): "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)"

Dec 12 05:55:13.862733 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:13.862183    1683 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-dw6sj" (UniqueName: "kubernetes.io/secret/756b44df-f705-445a-9e24-bf1e126b6c5a-default-token-dw6sj") pod "pod-submit-status-2-4" (UID: "756b44df-f705-445a-9e24-bf1e126b6c5a")

Dec 12 05:55:14.045725 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.045445    1683 kubelet.go:1850] SyncLoop (DELETE, "api"): "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)"
Dec 12 05:55:14.058721 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.058498    1683 reconciler.go:269] operationExecutor.MountVolume started for volume "default-token-dw6sj" (UniqueName: "kubernetes.io/secret/756b44df-f705-445a-9e24-bf1e126b6c5a-default-token-dw6sj") pod "pod-submit-status-2-4" (UID: "756b44df-f705-445a-9e24-bf1e126b6c5a")
Dec 12 05:55:14.090579 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.089034    1683 kubelet.go:1850] SyncLoop (DELETE, "api"): "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)"
Dec 12 05:55:14.096983 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.096951    1683 kubelet.go:1844] SyncLoop (REMOVE, "api"): "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)"
Dec 12 05:55:14.100720 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.097149    1683 kubelet.go:2038] Failed to delete pod "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)", err: pod not found
Dec 12 05:55:14.299745 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.299337    1683 operation_generator.go:663] MountVolume.SetUp succeeded for volume "default-token-dw6sj" (UniqueName: "kubernetes.io/secret/756b44df-f705-445a-9e24-bf1e126b6c5a-default-token-dw6sj") pod "pod-submit-status-2-4" (UID: "756b44df-f705-445a-9e24-bf1e126b6c5a")
Dec 12 05:55:14.584995 ip-10-0-168-106 hyperkube[1683]: I1212 05:55:14.581377    1683 kuberuntime_manager.go:425] No sandbox for pod "pod-submit-status-2-4_e2e-pods-5892(756b44df-f705-445a-9e24-bf1e126b6c5a)" can be found. Need to start a new one
Dec 12 05:55:14.590742 ip-10-0-168-106 crio[1648]: time="2020-12-12 05:55:14.589448461Z" level=info msg="Running pod sandbox: e2e-pods-5892/pod-submit-status-2-4/POD" id=9eb340e4-0a7b-419b-a2b0-025802037287 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 12 05:55:17.936748 ip-10-0-168-106 crio[1648]: time="2020-12-12 05:55:17.934805119Z" level=info msg="Got pod network &{Name:pod-submit-status-2-4 Namespace:e2e-pods-5892 ID:f42209c3922dcdfe5692971a9ac3fc47532efb7690b2fdd9b4e061a1efd3e2ef NetNS:/var/run/netns/cd7e6f4b-1650-45ab-ba60-3126086c6a32 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}]}"
Dec 12 05:55:18.009733 ip-10-0-168-106 crio[1648]: time="2020-12-12 05:55:18.005337584Z" level=error msg="Error adding network: Multus: [e2e-pods-5892/pod-submit-status-2-4]: error getting pod: pods \"pod-submit-status-2-4\" not found"

tl;dr the pod is deleted right after it is created, and kube runtime still tries to make the sandbox anyway. At that point crio calls multus and multus cannot find the pod so it errors out. ovn-kubernetes never even gets to CNI ADD the pod. This looks like either expected behavior and the test needs to be tweaked or a bug in kubelet.

Comment 15 Tim Rozet 2020-12-16 15:04:11 UTC
From the ovn-kubernetes perspective this bug is resolved. Outstanding CNI ADDs for older sandboxes should be canceled with a new CNI ADD for the same pod. Cloned this bug to the node team to investigate the kubelet side of these failures:
https://bugzilla.redhat.com/show_bug.cgi?id=1908378

Comment 26 errata-xmlrpc 2021-02-24 15:24:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.