Description of problem: Seeing "NetlinkError: (17, 'File exists')" and "OSError: [Errno 2] No such file or directory" in Kuryr cni logs after upgrade from 3.11.306 -> 3.11.394 Version-Release number of selected component (if applicable): 3.11.394 How reproducible: This is very difficult to reproduce, I upgrade the cluster and checked it after a few days, and found one kuryr-cni pod in CrashLoopBackOff Steps to Reproduce: 1. Original cluster on version 3.11.306 2. Using CRI-O (Not Docker) 3. Upgrade the cluster, deploy a few applications and wait Actual results: Kuryr CNI pods report these errors regularly, and one of them was in CrashLoopBackOff Expected results: This should never happen.
Created attachment 1762836 [details] infra-node-1-kuryr-cni-ds-v246c[CrashLoopBackOff].txt
Created attachment 1762837 [details] infra-node-1_kuryr-cni-ds-9nlht.txt
Created attachment 1762838 [details] infra-node-0_kuryr-cni-ds-wz84t.txt
Created attachment 1762839 [details] infra-node-2_kuryr-cni-ds-hs659.txt
Created attachment 1762840 [details] kuryr-controller-59d5d4fd5b-jrtxh.txt
Created attachment 1762841 [details] master-0_kuryr-cni-ds-xhwxr.txt
Created attachment 1762842 [details] master-1_kuryr-cni-ds-7kjpc.txt
Created attachment 1762844 [details] master-2_kuryr-cni-ds-p9chs.txt
Created attachment 1762845 [details] app-node-0_kuryr-cni-ds-qlz67.txt
Created attachment 1762846 [details] app-node-1_kuryr-cni-ds-sw9mr.txt
Created attachment 1762847 [details] app-node-2_kuryr-cni-ds-pz924.txt
Created attachment 1762848 [details] app-node-3_kuryr-cni-ds-4v8pz.txt
Attached the following logs: app-node-0_kuryr-cni-ds-qlz67.txt app-node-1_kuryr-cni-ds-sw9mr.txt app-node-2_kuryr-cni-ds-pz924.txt app-node-3_kuryr-cni-ds-4v8pz.txt infra-node-0_kuryr-cni-ds-wz84t.txt infra-node-1_kuryr-cni-ds-9nlht.txt infra-node-1-kuryr-cni-ds-v246c[CrashLoopBackOff].txt infra-node-2_kuryr-cni-ds-hs659.txt kuryr-controller-59d5d4fd5b-jrtxh.txt master-0_kuryr-cni-ds-xhwxr.txt master-1_kuryr-cni-ds-7kjpc.txt master-2_kuryr-cni-ds-p9chs.txt Notice, infra-node-1 was initially found in CrashLoopBackOff, so I captured the logs before restarting the pod, and then I captured the logs again. Which makes this BZ difficult to replicate.
Seeing errors like this (from infra-node-1-kuryr-cni-ds-v246c[CrashLoopBackOff].txt): 2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested Traceback (most recent call last): 2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 81, in connect 2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested iface.net_ns_fd = utils.convert_netns(netns) 2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested File "/usr/lib/python2.7/site-packages/pyroute2/ipdb/transactional.py", line 209, in __exit__ 2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested self.commit() 2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested File "/usr/lib/python2.7/site-packages/pyroute2/ipdb/interfaces.py", line 651, in commit 2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested raise newif 2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested NetlinkError: (17, 'File exists') 2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested ^[[00m 2021-03-11 00:50:25.704 41 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/proc/3981/ns/net', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-monitoring;K8S_POD_NAME=prometheus-k8s-0;K8S_POD_INFRA_CONTAINER_ID=3fbae9ebfc6a8a3ef0d23c0728c0435e306edc9b35bba267be011f3c2ff23ee4', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'3fbae9ebfc6a8a3ef0d23c0728c0435e306edc9b35bba267be011f3c2ff23ee4'}: NetlinkError: (17, 'File exists') And errors like this (from infra-node-2_kuryr-cni-ds-hs659.txt): 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last): 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 81, in add 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service vif = self.plugin.add(params) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 74, in add 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service vifs = self._do_work(params, b_base.connect, timeout) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 179, in _do_work 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service container_id=params.CNI_CONTAINERID) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/base.py", line 119, in connect 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service driver.connect(vif, ifname, netns, container_id) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 61, in connect 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service with b_base.get_ipdb(netns) as c_ipdb: 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/base.py", line 70, in get_ipdb 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service ipdb = pyroute2.IPDB(nl=pyroute2.NetNS(netns)) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/netns/nslink.py", line 167, in __init__ 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service super(NetNS, self).__init__() 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/iproute/linux.py", line 127, in __init__ 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service super(RTNL_API, self).__init__(*argv, **kwarg) 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/netlink/nlsocket.py", line 301, in __init__ 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service super(NetlinkMixin, self).__init__() 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/remote/__init__.py", line 176, in __init__ 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service raise init['error'] 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service OSError: [Errno 2] No such file or directory: '/host_proc/104348/ns/net' 2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service ^[[00m
Notice after restarting the cni pod on infra-node-1, the pod has restarted 30 times in the last 20 hours: [openshift@master-2 ~]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE kuryr-cni-ds-4v8pz 2/2 Running 733 3d XX.XX.192.169 app-node-3.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-7kjpc 2/2 Running 2 7d XX.XX.192.152 master-1.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-9nlht 2/2 Running 30 20h XX.XX.192.138 infra-node-1.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-hs659 2/2 Running 4 7d XX.XX.192.206 infra-node-2.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-p9chs 2/2 Running 2 7d XX.XX.192.111 master-2.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-pz924 2/2 Running 295 7d XX.XX.192.7 app-node-2.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-qlz67 2/2 Running 5 7d XX.XX.192.23 app-node-0.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-sw9mr 2/2 Running 121 7d XX.XX.192.70 app-node-1.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-wz84t 2/2 Running 2 7d XX.XX.192.197 infra-node-0.openshift-mydev2-mydc.mydomain <none> kuryr-cni-ds-xhwxr 2/2 Running 4 7d XX.XX.192.104 master-0.openshift-mydev2-mydc.mydomain <none> kuryr-controller-59d5d4fd5b-jrtxh 1/1 Running 505 3d XX.XX.192.206 infra-node-2.openshift-mydev2-mydc.mydomain <none> I'm guessing this is not preferable behaviour?
Hello, Any idea which release this might make it into? Cheers, Mohammad
Check with OSP13 OCP v3.11.420 Tempest tests passed I haven't seen the Netlink errors in the CNI pods logs and CNI pods are ready Cluster is up for ~12 hours
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 3.11.420 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1147