Bug 1956309

Summary: FailedCreatePodSandBox failed to create container veth: failed to make veth pair: operation not permitted
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NetworkingAssignee: Surya Seetharaman <surya>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED WORKSFORME Docs Contact:
Severity: high    
Priority: high CC: aconstan, anbhat, aos-bugs, obulatov, pehunt, pmuller
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: TechnicalReleaseBlocker
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-25 06:34:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-05-03 12:56:02 UTC
: [sig-network] pods should successfully create sandboxes by other expand_less 	0s
1 failures to create the sandbox

ns/e2e-persistent-local-volumes-test-7134 pod/pod-a5752853-17d6-40c3-b890-d25af8acbfa0 node/ip-10-0-190-119.ec2.internal - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-a5752853-17d6-40c3-b890-d25af8acbfa0_e2e-persistent-local-volumes-test-7134_4b3ff219-de7b-4336-abf2-f06313cdac06_0(fa8eeac03896cdf03c4f067c917397e030add4150d78a9629491987cbf262ee6): [e2e-persistent-local-volumes-test-7134/pod-a5752853-17d6-40c3-b890-d25af8acbfa0:openshift-sdn]: error adding container to network "openshift-sdn": failed to create container veth: failed to make veth pair: operation not permitted

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1388595670943797248

About 2% of CI runs have this failure in it in 4.8/master which is fairly high.

https://search.ci.openshift.org/?search=failed+to+make+veth+pair%3A+operation+not+permitted&maxAge=48h&context=1&type=bug%2Bjunit&name=4%5C.8%7Cmaster&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Peter Hunt 2021-05-03 19:14:37 UTC
I'm not sure yet if this is a blocker, I've set it on the wrong bug, as well as attached the wrong PR and set the wrong status, sorry for the noise

Comment 3 Peter Hunt 2021-05-04 18:51:29 UTC
I am naively guessing it's coming from https://github.com/containernetworking/plugins/blob/e27c48b391539f5536918b9c379c59ef2793cb0d/pkg/ip/link_linux.go#L36
though it's definitely coming from the sdn plugin somehow

Tossing over to networking

Comment 5 Surya Seetharaman 2021-05-14 15:05:39 UTC
Investigating the CI failure mentioned in #1: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1388595670943797248

SDN logs:

I0501 21:48:43.550057    2637 pod.go:544] CNI_DEL e2e-persistent-local-volumes-test-7134/pod-a5752853-17d6-40c3-b890-d25af8acbfa0
I0501 21:48:43.603407    2637 pod.go:544] CNI_DEL e2e-persistent-local-volumes-test-7134/pod-a5752853-17d6-40c3-b890-d25af8acbfa0
I0501 21:48:54.363427    2637 pod.go:508] CNI_ADD e2e-persistent-local-volumes-test-7134/pod-a5752853-17d6-40c3-b890-d25af8acbfa0 got IP 10.130.2.43, ofport 44
I0501 21:48:56.173815    2637 proxier.go:824] "syncProxyRules complete" elapsed="186.024749ms"
I0501 21:48:56.244799    2637 proxier.go:854] "Syncing iptables rules"
I0501 21:48:56.417458    2637 proxier.go:824] "syncProxyRules complete" elapsed="172.746104ms"
I0501 21:48:57.431283    2637 pod.go:544] CNI_DEL e2e-persistent-local-volumes-test-7134/pod-a5752853-17d6-40c3-b890-d25af8acbfa0

These suggest, the CNI_DEL was issued before the CNI_ADD request could get processed, possibly a creation timeout ? So DEL happens before ADD. and then the add fails.

Multus logs are not useful because they are not on debug mode.


Also looking at other CI failures: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.8/1393181488270282752

 [sig-network] pods should successfully create sandboxes by other expand_less	0s
1 failures to create the sandbox

ns/e2e-configmap-2364 pod/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc node/ci-op-xby2gqnn-3283a-2kc76-worker-c-s2wzc - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc_e2e-configmap-2364_d9fa8ca7-92b9-43fa-a537-65ac881283a8_0(5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1): [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1] [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1] failed to configure pod interface: failed to make veth pair: operation not permitted
'

This is deff not a SDN bug.

Comment 6 Surya Seetharaman 2021-05-17 10:21:38 UTC
Looking into the OVN-k8s logs, things look different:

I0514 13:11:41.058777    3165 cni.go:213] [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1] ADD starting CNI request [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1]
I0514 13:11:41.072232    3165 cni.go:223] [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1] ADD finished CNI request [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1], result "", err failed to configure pod interface: failed to make veth pair: operation not permitted
I0514 13:11:41.094925    3165 cni.go:213] [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1] DEL starting CNI request [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1]
I0514 13:11:41.130893    3165 healthcheck.go:142] Opening healthcheck "openshift-ingress/router-e2e-test-router-h2spec-ghwdb" on port 30317
I0514 13:11:41.137232    3165 cni.go:223] [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1] DEL finished CNI request [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1], result "", err <nil>
I0514 13:11:41.259022    3165 cni.go:213] [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1] DEL starting CNI request [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1]
I0514 13:11:41.286788    3165 cni.go:223] [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1] DEL finished CNI request [e2e-configmap-2364/pod-configmaps-7a4d568c-8859-4cea-a14d-fab2aaf5afbc 5a670dfca585f2d4cb206655ead549c491584c1017c47c8787786f5ece3e7db1], result "", err <nil>

It seems like CNI_ADD failed with that error and only then the delete is triggered ? Maybe for the SDN run, since I cannot see the initial sandbox request, maybe the logs got rotated and I missed the first failure...

Comment 7 Surya Seetharaman 2021-05-17 10:41:56 UTC
Seems to be similar/linked to https://github.com/containers/buildah/issues/2628?

Comment 8 Surya Seetharaman 2021-05-21 11:48:30 UTC
https://search.ci.openshift.org/?search=failed+to+make+veth+pair%3A+operation+not+permitted&maxAge=48h&context=1&type=bug%2Bjunit&name=4%5C.8%7Cmaster&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

The error seems to have gone away magically ? I don't know what changed. I'll keep an eye out for the next 24hrs. If no more failures come I'll close the bug.

Comment 9 Surya Seetharaman 2021-05-25 06:34:10 UTC
Unless I am missing something, the issue seems to have gotten resolved on its own (based on the search results)? Perhaps it was a kernel version on the CI nodes, not sure. However the failures seem to have gone away. Closing this bug for now. Feel free to (re)open a new one if it still persists.