Bug 1937991 - Seeing "NetlinkError: (17, 'File exists')" and "OSError: [Errno 2] No such file or directory" in Kuryr cni logs after upgrade from 3.11.306 -> 3.11.394
Summary: Seeing "NetlinkError: (17, 'File exists')" and "OSError: [Errno 2] No such fi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: 3.11.z
Assignee: Michał Dulko
QA Contact: Itzik Brown
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-11 21:19 UTC by Mohammad
Modified: 2021-04-21 20:21 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-21 20:21:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
infra-node-1-kuryr-cni-ds-v246c[CrashLoopBackOff].txt (48.78 KB, text/plain)
2021-03-11 21:23 UTC, Mohammad
no flags Details
infra-node-1_kuryr-cni-ds-9nlht.txt (48.21 KB, text/plain)
2021-03-11 21:23 UTC, Mohammad
no flags Details
infra-node-0_kuryr-cni-ds-wz84t.txt (220.69 KB, text/plain)
2021-03-11 21:24 UTC, Mohammad
no flags Details
infra-node-2_kuryr-cni-ds-hs659.txt (341.62 KB, text/plain)
2021-03-11 21:24 UTC, Mohammad
no flags Details
kuryr-controller-59d5d4fd5b-jrtxh.txt (96.19 KB, text/plain)
2021-03-11 21:25 UTC, Mohammad
no flags Details
master-0_kuryr-cni-ds-xhwxr.txt (29.42 KB, text/plain)
2021-03-11 21:29 UTC, Mohammad
no flags Details
master-1_kuryr-cni-ds-7kjpc.txt (390.17 KB, text/plain)
2021-03-11 21:30 UTC, Mohammad
no flags Details
master-2_kuryr-cni-ds-p9chs.txt (371.33 KB, text/plain)
2021-03-11 21:30 UTC, Mohammad
no flags Details
app-node-0_kuryr-cni-ds-qlz67.txt (2.64 MB, text/plain)
2021-03-11 21:31 UTC, Mohammad
no flags Details
app-node-1_kuryr-cni-ds-sw9mr.txt (137.51 KB, text/plain)
2021-03-11 21:32 UTC, Mohammad
no flags Details
app-node-2_kuryr-cni-ds-pz924.txt (49.31 KB, text/plain)
2021-03-11 21:34 UTC, Mohammad
no flags Details
app-node-3_kuryr-cni-ds-4v8pz.txt (31.69 KB, text/plain)
2021-03-11 21:34 UTC, Mohammad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift kuryr-kubernetes pull 482 0 None open Bug 1937991: CNI: Lookup offending interface on NetlinkError 2021-03-16 17:06:33 UTC
Red Hat Product Errata RHBA-2021:1147 0 None None None 2021-04-21 20:21:40 UTC

Description Mohammad 2021-03-11 21:19:13 UTC
Description of problem:

Seeing "NetlinkError: (17, 'File exists')" and "OSError: [Errno 2] No such file or directory" in Kuryr cni logs after upgrade from 3.11.306 -> 3.11.394

Version-Release number of selected component (if applicable): 3.11.394


How reproducible: This is very difficult to reproduce, I upgrade the cluster and checked it after a few days, and found one kuryr-cni pod in CrashLoopBackOff


Steps to Reproduce:
1. Original cluster on version 3.11.306
2. Using CRI-O (Not Docker)
3. Upgrade the cluster, deploy a few applications and wait

Actual results:

Kuryr CNI pods report these errors regularly, and one of them was in CrashLoopBackOff

Expected results:

This should never happen.

Comment 1 Mohammad 2021-03-11 21:23:07 UTC
Created attachment 1762836 [details]
infra-node-1-kuryr-cni-ds-v246c[CrashLoopBackOff].txt

Comment 2 Mohammad 2021-03-11 21:23:42 UTC
Created attachment 1762837 [details]
infra-node-1_kuryr-cni-ds-9nlht.txt

Comment 3 Mohammad 2021-03-11 21:24:18 UTC
Created attachment 1762838 [details]
infra-node-0_kuryr-cni-ds-wz84t.txt

Comment 4 Mohammad 2021-03-11 21:24:47 UTC
Created attachment 1762839 [details]
infra-node-2_kuryr-cni-ds-hs659.txt

Comment 5 Mohammad 2021-03-11 21:25:22 UTC
Created attachment 1762840 [details]
kuryr-controller-59d5d4fd5b-jrtxh.txt

Comment 6 Mohammad 2021-03-11 21:29:31 UTC
Created attachment 1762841 [details]
master-0_kuryr-cni-ds-xhwxr.txt

Comment 7 Mohammad 2021-03-11 21:30:27 UTC
Created attachment 1762842 [details]
master-1_kuryr-cni-ds-7kjpc.txt

Comment 8 Mohammad 2021-03-11 21:30:58 UTC
Created attachment 1762844 [details]
master-2_kuryr-cni-ds-p9chs.txt

Comment 9 Mohammad 2021-03-11 21:31:36 UTC
Created attachment 1762845 [details]
app-node-0_kuryr-cni-ds-qlz67.txt

Comment 10 Mohammad 2021-03-11 21:32:32 UTC
Created attachment 1762846 [details]
app-node-1_kuryr-cni-ds-sw9mr.txt

Comment 11 Mohammad 2021-03-11 21:34:14 UTC
Created attachment 1762847 [details]
app-node-2_kuryr-cni-ds-pz924.txt

Comment 12 Mohammad 2021-03-11 21:34:55 UTC
Created attachment 1762848 [details]
app-node-3_kuryr-cni-ds-4v8pz.txt

Comment 13 Mohammad 2021-03-11 21:36:37 UTC
Attached the following logs:

app-node-0_kuryr-cni-ds-qlz67.txt
app-node-1_kuryr-cni-ds-sw9mr.txt
app-node-2_kuryr-cni-ds-pz924.txt
app-node-3_kuryr-cni-ds-4v8pz.txt
infra-node-0_kuryr-cni-ds-wz84t.txt
infra-node-1_kuryr-cni-ds-9nlht.txt
infra-node-1-kuryr-cni-ds-v246c[CrashLoopBackOff].txt
infra-node-2_kuryr-cni-ds-hs659.txt
kuryr-controller-59d5d4fd5b-jrtxh.txt
master-0_kuryr-cni-ds-xhwxr.txt
master-1_kuryr-cni-ds-7kjpc.txt
master-2_kuryr-cni-ds-p9chs.txt

Notice, infra-node-1 was initially found in CrashLoopBackOff, so I captured the logs before restarting the pod, and then I captured the logs again.

Which makes this BZ difficult to replicate.

Comment 14 Mohammad 2021-03-11 21:42:03 UTC
Seeing errors like this (from infra-node-1-kuryr-cni-ds-v246c[CrashLoopBackOff].txt):

2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested Traceback (most recent call last):
2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 81, in connect
2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested     iface.net_ns_fd = utils.convert_netns(netns)
2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python2.7/site-packages/pyroute2/ipdb/transactional.py", line 209, in __exit__
2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested     self.commit()
2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python2.7/site-packages/pyroute2/ipdb/interfaces.py", line 651, in commit
2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested     raise newif
2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested NetlinkError: (17, 'File exists')
2021-03-11 00:50:25.702 41 ERROR kuryr_kubernetes.cni.binding.nested ^[[00m
2021-03-11 00:50:25.704 41 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/proc/3981/ns/net', 'CNI_PATH': u'/opt/cni/bin/:/opt/kuryr-cni/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-monitoring;K8S_POD_NAME=prometheus-k8s-0;K8S_POD_INFRA_CONTAINER_ID=3fbae9ebfc6a8a3ef0d23c0728c0435e306edc9b35bba267be011f3c2ff23ee4', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'3fbae9ebfc6a8a3ef0d23c0728c0435e306edc9b35bba267be011f3c2ff23ee4'}: NetlinkError: (17, 'File exists')

And errors like this (from infra-node-2_kuryr-cni-ds-hs659.txt):

2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last):
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 81, in add
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     vif = self.plugin.add(params)
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 74, in add
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     vifs = self._do_work(params, b_base.connect, timeout)
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 179, in _do_work
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     container_id=params.CNI_CONTAINERID)
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/base.py", line 119, in connect
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     driver.connect(vif, ifname, netns, container_id)
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 61, in connect
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     with b_base.get_ipdb(netns) as c_ipdb:
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/base.py", line 70, in get_ipdb
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     ipdb = pyroute2.IPDB(nl=pyroute2.NetNS(netns))
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/pyroute2/netns/nslink.py", line 167, in __init__
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     super(NetNS, self).__init__()
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/pyroute2/iproute/linux.py", line 127, in __init__
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     super(RTNL_API, self).__init__(*argv, **kwarg)
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/pyroute2/netlink/nlsocket.py", line 301, in __init__
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     super(NetlinkMixin, self).__init__()
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python2.7/site-packages/pyroute2/remote/__init__.py", line 176, in __init__
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service     raise init['error']
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service OSError: [Errno 2] No such file or directory: '/host_proc/104348/ns/net'
2021-03-10 08:30:12.153 42192 ERROR kuryr_kubernetes.cni.daemon.service ^[[00m

Comment 15 Mohammad 2021-03-11 21:44:42 UTC
Notice after restarting the cni pod on infra-node-1, the pod has restarted 30 times in the last 20 hours:

[openshift@master-2 ~]$ oc get pods -o wide
NAME                                READY     STATUS    RESTARTS   AGE       IP                NODE                                           NOMINATED NODE
kuryr-cni-ds-4v8pz                  2/2       Running   733        3d        XX.XX.192.169   app-node-3.openshift-mydev2-mydc.mydomain     <none>
kuryr-cni-ds-7kjpc                  2/2       Running   2          7d        XX.XX.192.152   master-1.openshift-mydev2-mydc.mydomain       <none>
kuryr-cni-ds-9nlht                  2/2       Running   30         20h       XX.XX.192.138   infra-node-1.openshift-mydev2-mydc.mydomain   <none>
kuryr-cni-ds-hs659                  2/2       Running   4          7d        XX.XX.192.206   infra-node-2.openshift-mydev2-mydc.mydomain   <none>
kuryr-cni-ds-p9chs                  2/2       Running   2          7d        XX.XX.192.111   master-2.openshift-mydev2-mydc.mydomain       <none>
kuryr-cni-ds-pz924                  2/2       Running   295        7d        XX.XX.192.7     app-node-2.openshift-mydev2-mydc.mydomain     <none>
kuryr-cni-ds-qlz67                  2/2       Running   5          7d        XX.XX.192.23    app-node-0.openshift-mydev2-mydc.mydomain     <none>
kuryr-cni-ds-sw9mr                  2/2       Running   121        7d        XX.XX.192.70    app-node-1.openshift-mydev2-mydc.mydomain     <none>
kuryr-cni-ds-wz84t                  2/2       Running   2          7d        XX.XX.192.197   infra-node-0.openshift-mydev2-mydc.mydomain   <none>
kuryr-cni-ds-xhwxr                  2/2       Running   4          7d        XX.XX.192.104   master-0.openshift-mydev2-mydc.mydomain       <none>
kuryr-controller-59d5d4fd5b-jrtxh   1/1       Running   505        3d        XX.XX.192.206   infra-node-2.openshift-mydev2-mydc.mydomain   <none>

I'm guessing this is not preferable behaviour?

Comment 18 Mohammad 2021-04-07 01:30:03 UTC
Hello,

Any idea which release this might make it into?

Cheers,
Mohammad

Comment 20 Itzik Brown 2021-04-18 10:49:14 UTC
Check with
OSP13
OCP v3.11.420
Tempest tests passed
I haven't seen the Netlink errors in the CNI pods logs and CNI pods are ready
Cluster is up for ~12 hours

Comment 22 errata-xmlrpc 2021-04-21 20:21:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.420 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1147


Note You need to log in before you can comment on or make changes to this bug.