Description of problem: We've noticed some orphaned crio network namespaces hanging on a setup, which prevented Kuryr-Kubernetes from plugging interfaces for other containers due to IP/VLAN ID conflicts. This is what we saw in crio regarding offending interfaces: Aug 18 10:30:04 ostest-wlfr4-worker-lc6sp crio[1505]: time="2020-08-18 10:30:04.946146953Z" level=error msg="Error deleting network: Multus: [openshift-marketplace/community-operators-64k69]: error reading the delegates: open /var/lib/cni/multus/904b0f1e18d0b07b9f58dfb1cb8e8ddc80690edefbd87e1f0275363377a6b725: no such file or directory" Aug 18 10:30:04 ostest-wlfr4-worker-lc6sp crio[1505]: time="2020-08-18 10:30:04.946195165Z" level=error msg="Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-marketplace/community-operators-64k69]: error reading the delegates: open /var/lib/cni/multus/904b0f1e18d0b07b9f58dfb1cb8e8ddc80690edefbd87e1f0275363377a6b725: no such file or directory" Version-Release number of selected component (if applicable): 4.6 How reproducible: Seems to be related to some race condition, happens quite rarely. Steps to Reproduce (more like how I got this): 1. Install OpenShift 4.6 on OpenStack with Kuryr. 2. Let it sit for ~24h. Actual results: Notice one kuryr-cni pod is crashlooping and some pods are stuck in ContainerCreating. In kuryr-cni logs you can see that it gets NetlinkError 17 - file exists, meaning that there's some other namespace that uses the same VLAN ID as kuryr-cni tries to set. This is due to orphaned netns'es. Expected results: Pods should run just fine. Additional info: Here's full crio log (600 MBs): https://drive.google.com/file/d/13wOLbyGECka8UAuKsETyzxqIyKU8t7g3/view?usp=sharing
I am seeing a whole lot of: ``` Aug 17 21:08:26 ostest-wlfr4-worker-lc6sp crio[1505]: time="2020-08-17 21:08:26.928756325Z" level=error msg="Error stopping network on cleanup: failed to destroy network for pod sandbox k8s_redhat-marketplace-s9s82_openshift-marketplace_8bdec5f4-b417-4374-90bc-cd365f22f2ee_0(5d0bdafa7e112977413c7092c9cac95ab78431f7c261a5e7c9570e82749c258b): delegateDel: error invoking DelegateDel - \"kuryr-cni\": error in getting result from DelNetwork: Looks like http://localhost:5036/delNetwork cannot be reached. Is kuryr-daemon running?: Post \"http://localhost:5036/delNetwork\": EOF" id=04efa482-f265-4348-a735-8516c8cb7c77 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox ``` what does this mean?
(In reply to Peter Hunt from comment #1) > I am seeing a whole lot of: > ``` > Aug 17 21:08:26 ostest-wlfr4-worker-lc6sp crio[1505]: time="2020-08-17 > 21:08:26.928756325Z" level=error msg="Error stopping network on cleanup: > failed to destroy network for pod sandbox > k8s_redhat-marketplace-s9s82_openshift-marketplace_8bdec5f4-b417-4374-90bc- > cd365f22f2ee_0(5d0bdafa7e112977413c7092c9cac95ab78431f7c261a5e7c9570e82749c25 > 8b): delegateDel: error invoking DelegateDel - \"kuryr-cni\": error in > getting result from DelNetwork: Looks like http://localhost:5036/delNetwork > cannot be reached. Is kuryr-daemon running?: Post > \"http://localhost:5036/delNetwork\": EOF" > id=04efa482-f265-4348-a735-8516c8cb7c77 > name=/runtime.v1alpha2.RuntimeService/RunPodSandbox > ``` > > what does this mean? Right, I could have explained that. Basically all CNI requests go to kuryr-daemon process that has knowledge about what VIFs map to which pod. If there's a persistent issue with plugging VIFs, kuryr-daemon livenessProbe fails and process is restarted with crashloop (that sometimes helped recovering from weird issues). This log is from those moments - when kuryr-daemon is not running. But the root issue here is the orphaned network namespace, which has the interface with conflicting VLAN ID that raises the NetlinkError in kuryr-daemon. Now that I think about it we *may* be able to try to improve that by making sure on CNI DEL we always clean up those interfaces - right now there might be cases when we don't. Anyway I think this wasn't happening in 4.5, so there has to be a change somewhere.
to begin finding the issue, I've opened https://github.com/cri-o/cri-o/pull/4102. That will tell us if we're leaking a namespace in cases where the pod failed to create and we're cleaning up. I am not sure that's the case we're seeing, but it's a nice to have nonetheless. It is unclear where you're seeing that we have orphaned network namespaces. I can't find any reporting that we're failing to clean up a namespace. One thing I just thought about is: in 4.6 we've moved the managed namespace removal from StopPodSandbox to RemovePodSandbox requests. However, the kubelet calls RemovePodSandbox much later than StopPodSandbox. Is it possible kuryr is trying to reuse VLAN IDs after a pod is stopped but before it's removed?
(In reply to Peter Hunt from comment #3) > to begin finding the issue, I've opened > https://github.com/cri-o/cri-o/pull/4102. That will tell us if we're leaking > a namespace in cases where the pod failed to create and we're cleaning up. I > am not sure that's the case we're seeing, but it's a nice to have > nonetheless. > > It is unclear where you're seeing that we have orphaned network namespaces. > I can't find any reporting that we're failing to clean up a namespace. How about those logs? Aug 18 10:30:04 ostest-wlfr4-worker-lc6sp crio[1505]: time="2020-08-18 10:30:04.946146953Z" level=error msg="Error deleting network: Multus: [openshift-marketplace/community-operators-64k69]: error reading the delegates: open /var/lib/cni/multus/904b0f1e18d0b07b9f58dfb1cb8e8ddc80690edefbd87e1f0275363377a6b725: no such file or directory" Aug 18 10:30:04 ostest-wlfr4-worker-lc6sp crio[1505]: time="2020-08-18 10:30:04.946195165Z" level=error msg="Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-marketplace/community-operators-64k69]: error reading the delegates: open /var/lib/cni/multus/904b0f1e18d0b07b9f58dfb1cb8e8ddc80690edefbd87e1f0275363377a6b725: no such file or directory" When debugging this my impression was that this happened for the conflicting namespaces - and those namespaces stayed there. > One thing I just thought about is: in 4.6 we've moved the managed namespace > removal from StopPodSandbox to RemovePodSandbox requests. However, the > kubelet calls RemovePodSandbox much later than StopPodSandbox. Is it > possible kuryr is trying to reuse VLAN IDs after a pod is stopped but before > it's removed? Wow, nice thinking! Basically with Kuryr that's driven by K8s API (I know, that's bad), so once pod resource in the K8s API is deleted, we can reuse the VLAN ID. So what you describe totally fits as a culprit.
it is unlikely it will be fixed this sprint, bumping to next one > When debugging this my impression was that this happened for the conflicting namespaces - and those namespaces stayed there. AFAIK, `/var/lib/cni/multus/904b0f1e18d0b07b9f58dfb1cb8e8ddc80690edefbd87e1f0275363377a6b725` is not a namespace. The network namespaces cri-o creates are of the form `/var/run/netns/$RANDOM_ID`. I am not sure what that multus file is
(In reply to Peter Hunt from comment #5) > it is unlikely it will be fixed this sprint, bumping to next one > > > When debugging this my impression was that this happened for the conflicting namespaces - and those namespaces stayed there. > > AFAIK, > `/var/lib/cni/multus/ > 904b0f1e18d0b07b9f58dfb1cb8e8ddc80690edefbd87e1f0275363377a6b725` is not a > namespace. The network namespaces cri-o creates are of the form > `/var/run/netns/$RANDOM_ID`. I am not sure what that multus file is I only matched that log through pod name, nevertheless I was able to find conflicting orphaned namespaces in /var/run/netns. It might be due to Kuryr's behaviour though.
Seems like we are able to work this around in Kuryr so moving it there. It's most likely not cri-o's fault as we probably assumed an implementation detail is API.
Failed on 4.6.0-0.nightly-2020-09-05-015624 over OSP16.1 RHOS-16.1-RHEL-8-20200821.n.0 with OVN. The issue is observed during upgrade from 4.5.8 to 4.6.0-0.nightly-2020-09-05-015624 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.8 True False 69s Cluster version is 4.5.8 $ oc get pods -n openshift-kuryr -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kuryr-cni-6sfjr 1/1 Running 0 39m 10.196.3.94 ostest-xcbsx-master-0 <none> <none> kuryr-cni-g5jk6 1/1 Running 0 39m 10.196.0.33 ostest-xcbsx-master-2 <none> <none> kuryr-cni-kp8cr 1/1 Running 1 19m 10.196.1.58 ostest-xcbsx-worker-cb5kt <none> <none> kuryr-cni-qgc7s 1/1 Running 1 19m 10.196.2.90 ostest-xcbsx-worker-8sc92 <none> <none> kuryr-cni-tnqbg 1/1 Running 0 19m 10.196.2.16 ostest-xcbsx-worker-ctm65 <none> <none> kuryr-cni-tsj6x 1/1 Running 0 39m 10.196.3.204 ostest-xcbsx-master-1 <none> <none> kuryr-controller-6f999575cd-qsxdz 1/1 Running 2 39m 10.196.3.94 ostest-xcbsx-master-0 <none> <none> $ oc adm upgrade --to-image="registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-05-015624" --allow-explicit-upgrade --force=true $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.8 True True 113m Unable to apply 4.6.0-0.nightly-2020-09-05-015624: the cluster operator monitoring is degraded $ oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-09-05-015624 True False False 33m cloud-credential 4.6.0-0.nightly-2020-09-05-015624 True False False 160m cluster-autoscaler 4.6.0-0.nightly-2020-09-05-015624 True False False 146m config-operator 4.6.0-0.nightly-2020-09-05-015624 True False False 146m console 4.6.0-0.nightly-2020-09-05-015624 True False False 81m csi-snapshot-controller 4.6.0-0.nightly-2020-09-05-015624 True False False 132m dns 4.5.8 True False False 151m etcd 4.6.0-0.nightly-2020-09-05-015624 True False False 151m image-registry 4.6.0-0.nightly-2020-09-05-015624 True False False 133m ingress 4.6.0-0.nightly-2020-09-05-015624 True False False 82m insights 4.6.0-0.nightly-2020-09-05-015624 True False False 148m kube-apiserver 4.6.0-0.nightly-2020-09-05-015624 True False False 151m kube-controller-manager 4.6.0-0.nightly-2020-09-05-015624 True False False 150m kube-scheduler 4.6.0-0.nightly-2020-09-05-015624 True False False 150m kube-storage-version-migrator 4.6.0-0.nightly-2020-09-05-015624 True False False 133m machine-api 4.6.0-0.nightly-2020-09-05-015624 True False False 144m machine-approver 4.6.0-0.nightly-2020-09-05-015624 True False False 149m machine-config 4.5.8 True False False 144m marketplace 4.6.0-0.nightly-2020-09-05-015624 True False False 83m monitoring 4.6.0-0.nightly-2020-09-05-015624 False True True 63m network 4.6.0-0.nightly-2020-09-05-015624 True False False 154m node-tuning 4.6.0-0.nightly-2020-09-05-015624 True False False 83m openshift-apiserver 4.6.0-0.nightly-2020-09-05-015624 True False False 83m openshift-controller-manager 4.6.0-0.nightly-2020-09-05-015624 True False False 148m openshift-samples 4.6.0-0.nightly-2020-09-05-015624 True False False 82m operator-lifecycle-manager 4.6.0-0.nightly-2020-09-05-015624 True False False 151m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-09-05-015624 True False False 151m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-09-05-015624 True False False 82m service-ca 4.6.0-0.nightly-2020-09-05-015624 True False False 153m storage 4.6.0-0.nightly-2020-09-05-015624 True False False 74m $ oc get pods -n openshift-kuryr -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kuryr-cni-2sxpr 1/1 Running 0 86m 10.196.2.16 ostest-xcbsx-worker-ctm65 <none> <none> kuryr-cni-5l4r6 0/1 CrashLoopBackOff 19 82m 10.196.2.90 ostest-xcbsx-worker-8sc92 <none> <none> kuryr-cni-rqzhs 1/1 Running 17 84m 10.196.3.94 ostest-xcbsx-master-0 <none> <none> kuryr-cni-vcqff 1/1 Running 0 81m 10.196.3.204 ostest-xcbsx-master-1 <none> <none> kuryr-cni-zdfr7 1/1 Running 0 77m 10.196.0.33 ostest-xcbsx-master-2 <none> <none> kuryr-cni-zq8gj 1/1 Running 0 79m 10.196.1.58 ostest-xcbsx-worker-cb5kt <none> <none> kuryr-controller-76c88885d6-jgxfv 1/1 Running 0 86m 10.196.3.94 ostest-xcbsx-master-0 <none> <none> kuryr-cni-5l4r6 is restarting all the time due to NetlinkError: 58 2020-09-07 10:07:37.272 53 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': 'eth0', 'CNI_NETNS': '/var/run/netns/0b8e87 09-f43d-4173-ad99-41fb7afcbd39', 'CNI_PATH': '/opt/multus/bin:/var/lib/cni/bin:/usr/libexec/cni', 'CNI_COMMAND': 'ADD', 'CNI_CONTAINERID': '438baf2b8de5daeb7efc0fce2a20dd922a224a800d211 ee9e85ac502634c0a21', 'CNI_ARGS': 'IgnoreUnknown=true;K8S_POD_NAMESPACE=openshift-marketplace;K8S_POD_NAME=redhat-operators-thvcs;K8S_POD_INFRA_CONTAINER_ID=438baf2b8de5daeb7efc0fce2a20 dd922a224a800d211ee9e85ac502634c0a21'}: pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists') openshifft-marketplace pods are hung on ContainerCreating: $ oc get pods -A -o wide| grep -v -e Running -e Completed NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-kuryr kuryr-cni-5l4r6 0/1 CrashLoopBackOff 17 70m 10.196.2.90 ostest-xcbsx-worker-8sc92 <none> <none> openshift-marketplace certified-operators-bcgvf 0/1 ContainerCreating 0 8m35s <none> ostest-xcbsx-worker-8sc92 <none> <none> openshift-marketplace certified-operators-frbv9 0/1 ContainerCreating 0 23m <none> ostest-xcbsx-worker-8sc92 <none> <none> openshift-marketplace redhat-marketplace-5tlnn 0/1 ContainerCreating 0 23m <none> ostest-xcbsx-worker-8sc92 <none> <none> openshift-marketplace redhat-marketplace-l4lsp 0/1 ContainerCreating 0 8m35s <none> ostest-xcbsx-worker-8sc92 <none> <none> openshift-marketplace redhat-operators-qxmkr 0/1 ContainerCreating 0 8m35s <none> ostest-xcbsx-worker-8sc92 <none> <none> openshift-monitoring prometheus-operator-79b6894bf6-5wl6z 0/2 ContainerCreating 0 69m <none> ostest-xcbsx-master-0 <none> <none> Below errors are observed on CRIO at worker ostest-xcbsx-worker-8sc92: Sep 07 10:14:13 ostest-xcbsx-worker-8sc92 crio[1634]: 2020-09-07T10:14:13Z [error] Multus: [openshift-marketplace/redhat-marketplace-tnbmp]: error reading the delegates: open /var/lib/cni/multus/99cbc7f33a04c3fe9a350a135fde92590da00fce0828dcdcad2f7a28ddc380c2: no such file or directory Attaching CNI and CRIO logs.
Created attachment 1713947 [details] kuryr-cni logs crashing due to NetlinkError after OCP upgrade
Created attachment 1713950 [details] output of journalctl | grep crio > crio_logs.txt in the affected worker
Created attachment 1714026 [details] occurence after 24h running Hit the issue after keeping the cluster running for 24 hours. Attaching the logs under kuryr_cni_Crashloopback_after24h_4.6.0-0.nightly-2020-09-05-015624.tgz
(In reply to Peter Hunt from comment #5) > it is unlikely it will be fixed this sprint, bumping to next one > > > When debugging this my impression was that this happened for the conflicting namespaces - and those namespaces stayed there. > > AFAIK, > `/var/lib/cni/multus/ > 904b0f1e18d0b07b9f58dfb1cb8e8ddc80690edefbd87e1f0275363377a6b725` is not a > namespace. The network namespaces cri-o creates are of the form > `/var/run/netns/$RANDOM_ID`. I am not sure what that multus file is For what it's worth `/var/lib/cni/multus/904b0f1e18d0b07b9f58dfb1cb8e8ddc80690edefbd87e1f0275363377a6b725` is a cache file as created by Multus via libcni, it uses this information to perform a delete.
Saw it with 4.6.0-0.nightly-2020-09-22-213802
The culprit of this maybe was related to this https://bugzilla.redhat.com/show_bug.cgi?id=1869606, so lowering the severity to medium as we are not hitting this at the moment with the fix for that one. Feel free to rise it again if you hit it again.
Issue observed on 4.6.0-0.nightly-2020-10-02-001427 that includes the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1884139, so it is not helping.
Verified on 4.7.0-0.nightly-2020-10-24-042953 over OSP16.1 with OVN-Octavia (RHOS-16.1-RHEL-8-20201007.n.0) After 2 days of cluster up, no kuryr-cni pod went to CrashLoopBackOff: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-10-24-042953 True False 2d7h Cluster version is 4.7.0-0.nightly-2020-10-24- 042953 $ oc get pods -n openshift-kuryr NAME READY STATUS RESTARTS AGE kuryr-cni-6wbpb 1/1 Running 2 2d6h kuryr-cni-8rnvz 1/1 Running 8 2d6h kuryr-cni-cw6xq 1/1 Running 0 2d6h kuryr-cni-hjnzr 1/1 Running 0 2d6h kuryr-cni-rx7tn 1/1 Running 2 2d6h kuryr-cni-sfw8h 1/1 Running 0 2d6h kuryr-controller-7ccfc75cf4-5qzpt 1/1 Running 0 2d6h NP and conformance tests run with expected results.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633