Bug 1779163 - NetlinkError: (17, 'File exists') when kuryr-daemon gets restarted in a wrong moment
Summary: NetlinkError: (17, 'File exists') when kuryr-daemon gets restarted in a wrong...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: 4.4.z
Assignee: Michał Dulko
QA Contact: GenadiC
URL:
Whiteboard:
Depends On: 1810574
Blocks: 1786298
TreeView+ depends on / blocked
 
Reported: 2019-12-03 12:46 UTC by Michał Dulko
Modified: 2020-07-06 20:47 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1786298 1810574 (view as bug list)
Environment:
Last Closed: 2020-07-06 20:47:16 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift kuryr-kubernetes pull 119 None closed Bug 1779163: Nested CNI: Look for leftover ifaces to remove 2020-07-21 14:02:46 UTC
Github openshift kuryr-kubernetes pull 182 None closed [release-4.4] Bug 1779163: Nested CNI: Remove interfaces on DEL requests 2020-07-21 14:02:46 UTC
Github openshift kuryr-kubernetes pull 246 None closed [release-4.4] Bug 1779163: CNI: Confirm pods in cache before connecting 2020-07-21 14:02:46 UTC
Github openshift kuryr-kubernetes pull 280 None closed Bug 1779163: CNI: Don't wait for missing pods on DEL 2020-07-21 14:02:46 UTC
Launchpad 1854928 None None None 2019-12-03 12:47:37 UTC
OpenStack gerrit 697069 'None' MERGED Nested CNI: Look for leftover ifaces to remove 2020-07-21 14:02:46 UTC
OpenStack gerrit 702672 None MERGED Gather debug info when NetlinkError EEXIST happens 2020-07-21 14:02:45 UTC
Red Hat Product Errata RHBA-2020:2786 None None None 2020-07-06 20:47:33 UTC

Description Michał Dulko 2019-12-03 12:46:44 UTC
Description of problem:
It's possible kuryr-daemon gets restarted when the interface for the pod was created, moved to container netns, but not yet returned to kubelet. Such a CNI request will get retried by kubelet, but now with the interface already existing, kuryr-daemon will be unable to recover from the problem.

Version-Release number of selected component (if applicable):


How reproducible: Randomly


Steps to Reproduce:
1. Happens randomly when kuryr-daemon gets restarted by healthchecks. It should be possible to simulate it by injecting an interface with a name and vlan id that is supposed to be used.

Actual results:
kuryr-daemon keeps getting restarting producing following logs:
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/proc/32678/ns/net', 'CNI_PATH': u'/var/lib/c
ni/bin:/opt/multus/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=e2e-container-runtime-9428;K8S_POD_NAME=terminate-cmd-rpnaa348808-f024-11e9-93b6-801844f3e3fc;K8S_POD_INFRA_CONTAINER_ID=d33ea2aade0ae1a72f
6a1fbebb5e945b3ac0deabaf990e534252dbb6e34045cd', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'd33ea2aade0ae1a72f6a1fbebb5e945b3ac0deabaf990e534252dbb6e34045cd'}: NetlinkError: (17, 'File exists')
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last):
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 81, in add
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service vif = self.plugin.add(params)
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 51, in add
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service vifs = self._do_work(params, b_base.connect)
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 138, in _do_work
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service container_id=params.CNI_CONTAINERID)
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/base.py", line 100, in connect
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service driver.connect(vif, ifname, netns, container_id)
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 56, in connect
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service iface.net_ns_fd = utils.convert_netns(netns)
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/ipdb/transactional.py", line 209, in __exit__
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service self.commit()
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/ipdb/interfaces.py", line 651, in commit
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service raise newif
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service NetlinkError: (17, 'File exists')
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service

Expected results:
kuryr-daemon recovers and proceeds with port binding.


Additional info:

Comment 3 Jon Uriarte 2020-03-18 13:02:42 UTC
Verified in 4.4.0-0.nightly-2020-03-16-124852 on top of OSP 13.0.11 2020-03-10.1 puddle.

After running kubernetes Network Policy tests and conformance tests, I couldn't find the error described in the BZ
in any of the kuryr cni pods.

Comment 4 Michał Dulko 2020-03-18 14:53:41 UTC
I feel obliged to write that as the bug is really hard and random to reproduce it's possible we'll see it again and will need to reopen this. Right now we're almost sure we found the root cause though.

Comment 5 Jon Uriarte 2020-04-07 13:32:57 UTC
Moving back to Assigned as it was reproduced in 4.5:

Failed QA in OCP 4.5.0-0.nightly-2020-04-07-040639 on OSP 16 RHOS_TRUNK-16.0-RHEL-8-20200403.n.1 (with OVS).

The installer fails waiting for the cluster to initialize, and one kury-cni pod remains in crashloop:

$ oc get pods -n openshift-kuryr
NAME                              READY   STATUS    RESTARTS   AGE
kuryr-cni-7sbvw                   1/1     Running   0          5h11m
kuryr-cni-bg2hj                   1/1     Running   57         5h23m
kuryr-cni-nlxld                   1/1     Running   4          5h43m
kuryr-cni-sc922                   1/1     Running   2          5h43m
kuryr-cni-vfb7n                   1/1     Running   2          5h43m
kuryr-cni-vsvnc                   1/1     Running   0          5h11m
kuryr-controller-9dff6456-7kjw9   1/1     Running   1          5h43m


ERROR kuryr_kubernetes.cni.binding.nested [-] Creation of pod interface failed, most likely due to duplicated VLAN id. This will probably cause kuryr-daemon to crashloop. Trying to gat
 pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
ERROR kuryr_kubernetes.cni.binding.nested Traceback (most recent call last):
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 86, in connect
ERROR kuryr_kubernetes.cni.binding.nested     iface.net_ns_fd = utils.convert_netns(netns) 
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/transactional.py", line 209, in __exit__
ERROR kuryr_kubernetes.cni.binding.nested     self.commit()
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 650, in commit
ERROR kuryr_kubernetes.cni.binding.nested     raise newif
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 589, in commit
ERROR kuryr_kubernetes.cni.binding.nested     self.nl.link('add', **request)
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/iproute/linux.py", line 1163, in link
ERROR kuryr_kubernetes.cni.binding.nested     msg_flags=msg_flags)
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 373, in nlm_request
ERROR kuryr_kubernetes.cni.binding.nested     return tuple(self._genlm_request(*argv, **kwarg))
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 864, in nlm_request
ERROR kuryr_kubernetes.cni.binding.nested     callback=callback):
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 376, in get
ERROR kuryr_kubernetes.cni.binding.nested     return tuple(self._genlm_get(*argv, **kwarg))
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 701, in get
ERROR kuryr_kubernetes.cni.binding.nested     raise msg['header']['error']
ERROR kuryr_kubernetes.cni.binding.nested pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
ERROR kuryr_kubernetes.cni.binding.nested 
ERROR kuryr_kubernetes.cni.binding.nested [-] List of host interfaces: {1: {'address': '00:00:00:00:00:00', 'broadcast': '00:00:00:00:00:00', 'ifname': 'lo', 'mtu': 65536, 'qdisc': 'no 
queue', 'txqlen': 1000, 'operstate': 'UNKNOWN', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 0, 'proto_down': 0, 'gso_max_segs': 65535, 'g
so_max_size': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 0, 'carrier_down_count': 0, 'index': 1, 'flags': 65609, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipaddr': (('::1', 128)
, ('127.0.0.1', 8)), 'ports': (), 'family': 0, 'ifi_type': 772, 'state': 'up', 'neighbours': ('0.0.0.0',)}, 'lo': {'address': '00:00:00:00:00:00', 'broadcast': '00:00:00:00:00:00', 'ifname': 'lo', 'mtu': 65536,
'qdisc': 'noqueue', 'txqlen': 1000, 'operstate': 'UNKNOWN', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 0, 'proto_down': 0, 'gso_max_segs
': 65535, 'gso_max_size': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 0, 'carrier_down_count': 0, 'index': 1, 'flags': 65609, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipaddr': (
('::1', 128), ('127.0.0.1', 8)), 'ports': (), 'family': 0, 'ifi_type': 772, 'state': 'up', 'neighbours': ('0.0.0.0',)}, 2: {'address': 'fa:16:3e:ff:5e:0f', 'broadcast': 'ff:ff:ff:ff:ff:ff', 'ifname': 'ens3', 'mt
u': 1450, 'qdisc': 'fq_codel', 'txqlen': 1000, 'operstate': 'UP', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 2, 'proto_down': 0, 'gso_ma
x_segs': 65535, 'gso_max_size': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 1, 'carrier_down_count': 1, 'index': 2, 'flags': 69699, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipad
dr': (('fe80::a777:cb9f:a473:bc9a', 64), ('10.196.1.28', 16)), 'ports': (), 'family': 0, 'ifi_type': 1, 'state': 'up', 'neighbours': ('10.196.0.12', '10.196.3.50', '10.196.0.102', '10.196.0.11', '224.0.0.22', '1
0.196.0.1', '10.196.0.5', '224.0.0.18', '10.196.0.7', '10.196.1.206', '224.0.0.251')}, 'ens3': {'address': 'fa:16:3e:ff:5e:0f', 'broadcast': 'ff:ff:ff:ff:ff:ff', 'ifname': 'ens3', 'mtu': 1450, 'qdisc': 'fq_codel
', 'txqlen': 1000, 'operstate': 'UP', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 2, 'proto_down': 0, 'gso_max_segs': 65535, 'gso_max_siz
e': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 1, 'carrier_down_count': 1, 'index': 2, 'flags': 69699, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipaddr': (('fe80::a777:cb9f:a473:bc9a', 64), ('10.196.1.28', 16)), 'ports': (), 'family': 0, 'ifi_type': 1, 'state': 'up', 'neighbours': ('10.196.0.12', '10.196.3.50', '10.196.0.102', '10.196.0.11', '224.0.0.22', '10.196.0.1', '10.196.0.5', '2
24.0.0.18', '10.196.0.7', '10.196.1.206', '224.0.0.251')}}: pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
ERROR kuryr_kubernetes.cni.binding.nested [-] List of pod namespace interfaces: {1: {'address': '00:00:00:00:00:00', 'broadcast': '00:00:00:00:00:00', 'ifname': 'lo', 'mtu': 65536, 'qd
isc': 'noqueue', 'txqlen': 1000, 'operstate': 'UNKNOWN', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 0, 'proto_down': 0, 'gso_max_segs':
65535, 'gso_max_size': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 0, 'carrier_down_count': 0, 'index': 1, 'flags': 65609, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipaddr': ((':
:1', 128), ('127.0.0.1', 8)), 'ports': (), 'family': 0, 'ifi_type': 772, 'state': 'up', 'neighbours': ()}, 'lo': {'address': '00:00:00:00:00:00', 'broadcast': '00:00:00:00:00:00', 'ifname': 'lo', 'mtu': 65536, '
qdisc': 'noqueue', 'txqlen': 1000, 'operstate': 'UNKNOWN', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 0, 'proto_down': 0, 'gso_max_segs'
: 65535, 'gso_max_size': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 0, 'carrier_down_count': 0, 'index': 1, 'flags': 65609, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipaddr': ((
'::1', 128), ('127.0.0.1', 8)), 'ports': (), 'family': 0, 'ifi_type': 772, 'state': 'up', 'neighbours': ()}}: pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': 'eth0', 'CNI_NETNS': '/proc/134124/ns/net', 'CNI_PATH': '/opt/multus/
bin:/var/lib/cni/bin', 'CNI_COMMAND': 'ADD', 'CNI_CONTAINERID': '97359d9fb8a375cb2d0dcb86baa6981366363d04f912b910a3c3b4308baf1396', 'CNI_ARGS': 'IgnoreUnknown=true;K8S_POD_NAMESPACE=openshift-monitoring;K8S_POD_
NAME=thanos-querier-6ccdccb756-znrtz;K8S_POD_INFRA_CONTAINER_ID=97359d9fb8a375cb2d0dcb86baa6981366363d04f912b910a3c3b4308baf1396'}: pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')

Comment 9 rlobillo 2020-06-30 10:05:30 UTC
Verified on OCP4.4.0-0.nightly-2020-06-29-071755 over OSP16.1 (RHOS-16.1-RHEL-8-20200625.n.0) with OVN.

Installation and NP tests run OK.

(overcloud) [stack@undercloud-0 ~]$ oc get pods -n openshift-kuryr
NAME                                   READY   STATUS    RESTARTS   AGE
kuryr-cni-5kmxg                        1/1     Running   1          23h
kuryr-cni-62wzm                        1/1     Running   1          23h
kuryr-cni-fzp99                        1/1     Running   1          22h
kuryr-cni-mlqsr                        1/1     Running   0          22h
kuryr-cni-ndbzp                        1/1     Running   1          23h
kuryr-cni-vt7dw                        1/1     Running   2          22h
kuryr-controller-795b7b94c5-mrmg6      1/1     Running   0          23h
kuryr-dns-admission-controller-ndznd   1/1     Running   0          23h
kuryr-dns-admission-controller-w4dgn   1/1     Running   0          23h
kuryr-dns-admission-controller-zxtdj   1/1     Running   0          23h

No NetlinkErrors observed:

$ for i in $(oc get pods -n openshift-kuryr -l app=kuryr-cni -o NAME ); do echo "# $i"; oc logs -n openshift-kuryr $i | grep "NetlinkError"; done
# pod/kuryr-cni-5kmxg
# pod/kuryr-cni-62wzm
# pod/kuryr-cni-fzp99
# pod/kuryr-cni-mlqsr
# pod/kuryr-cni-ndbzp
# pod/kuryr-cni-vt7dw

One of the CNI pods (pod/kuryr-cni-fzp99) restarted once due to cri-o race condition when clearing the namespace. However, the pod recovered by itself and functionality is not affected.

020-06-29 12:22:54.057 5209 ERROR kuryr_kubernetes.cni.binding.nested [-] Creation of pod interface failed due to VLAN ID (vlan_info={'kind': 'vlan', 'vlan_id': 3469}) conflict. Probably the CRI had not cleaned up the network namespace of deleted pods. This should not be a permanent issue but may cause restart of kuryr-cni pod.: pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')

Comment 11 errata-xmlrpc 2020-07-06 20:47:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2786


Note You need to log in before you can comment on or make changes to this bug.