Bug 1786298 - NetlinkError: (17, 'File exists') when kuryr-daemon gets restarted in a wrong moment
Summary: NetlinkError: (17, 'File exists') when kuryr-daemon gets restarted in a wrong...
Keywords:
Status: MODIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: 4.3.z
Assignee: Michał Dulko
QA Contact: GenadiC
URL:
Whiteboard:
Depends On: 1779163
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-24 09:45 UTC by Luis Tomas Bolivar
Modified: 2020-07-10 07:01 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1779163
Environment:
Last Closed: 2020-04-01 19:10:45 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift kuryr-kubernetes pull 120 None closed [release-4.3] Bug 1786298: Nested CNI: Look for leftover ifaces to remove 2020-07-09 06:18:51 UTC
Github openshift kuryr-kubernetes pull 183 None closed [release-4.3] Bug 1786298: Nested CNI: Remove interfaces on DEL requests 2020-07-09 06:18:51 UTC
Github openshift kuryr-kubernetes pull 292 None closed Bug 1786298: Bunch of kuryr-daemon NetlinkError fixes 2020-07-07 06:47:12 UTC
Red Hat Product Errata RHBA-2020:0930 None None None 2020-04-01 19:12:20 UTC

Description Luis Tomas Bolivar 2019-12-24 09:45:23 UTC
+++ This bug was initially created as a clone of Bug #1779163 +++

Description of problem:
It's possible kuryr-daemon gets restarted when the interface for the pod was created, moved to container netns, but not yet returned to kubelet. Such a CNI request will get retried by kubelet, but now with the interface already existing, kuryr-daemon will be unable to recover from the problem.

Version-Release number of selected component (if applicable):


How reproducible: Randomly


Steps to Reproduce:
1. Happens randomly when kuryr-daemon gets restarted by healthchecks. It should be possible to simulate it by injecting an interface with a name and vlan id that is supposed to be used.

Actual results:
kuryr-daemon keeps getting restarting producing following logs:
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': u'eth0', 'CNI_NETNS': u'/proc/32678/ns/net', 'CNI_PATH': u'/var/lib/c
ni/bin:/opt/multus/bin', 'CNI_ARGS': u'IgnoreUnknown=1;K8S_POD_NAMESPACE=e2e-container-runtime-9428;K8S_POD_NAME=terminate-cmd-rpnaa348808-f024-11e9-93b6-801844f3e3fc;K8S_POD_INFRA_CONTAINER_ID=d33ea2aade0ae1a72f
6a1fbebb5e945b3ac0deabaf990e534252dbb6e34045cd', 'CNI_COMMAND': u'ADD', 'CNI_CONTAINERID': u'd33ea2aade0ae1a72f6a1fbebb5e945b3ac0deabaf990e534252dbb6e34045cd'}: NetlinkError: (17, 'File exists')
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last):
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 81, in add
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service vif = self.plugin.add(params)
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 51, in add
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service vifs = self._do_work(params, b_base.connect)
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 138, in _do_work
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service container_id=params.CNI_CONTAINERID)
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/base.py", line 100, in connect
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service driver.connect(vif, ifname, netns, container_id)
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 56, in connect
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service iface.net_ns_fd = utils.convert_netns(netns)
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/ipdb/transactional.py", line 209, in __exit__
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service self.commit()
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python2.7/site-packages/pyroute2/ipdb/interfaces.py", line 651, in commit
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service raise newif
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service NetlinkError: (17, 'File exists')
2019-10-16 14:53:58.195 384 ERROR kuryr_kubernetes.cni.daemon.service

Expected results:
kuryr-daemon recovers and proceeds with port binding.


Additional info:

Comment 2 Luis Tomas Bolivar 2020-01-02 08:03:18 UTC
*** Bug 1786218 has been marked as a duplicate of this bug. ***

Comment 3 Itzik Brown 2020-01-05 08:49:21 UTC
Hi,
Please add information about how to verify.

Comment 4 Luis Tomas Bolivar 2020-01-07 08:08:05 UTC
Taking a look at the kuryr cni pod logs, while for instance running the OCP QE tests. And then check that those errors are now not there even if the kuryr-cni pods gets restarted

Comment 5 Chao Yang 2020-01-08 09:24:10 UTC
Hi

I reproduced this issue on

4.3.0-0.nightly-2020-01-06-185654

2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': 'eth0', 'CNI_NETNS': '/proc/1909009/ns/net', 'CNI_PATH': '/opt/multus/bin:/var/lib/cni/bin', 'CNI_COMMAND': 'ADD', 'CNI_CONTAINERID': '4de18b6bcd163bea42cae9f619052080aad749474228630827cd3edaa2bf532b', 'CNI_ARGS': 'IgnoreUnknown=true;K8S_POD_NAMESPACE=openshift-kuryr;K8S_POD_NAME=iscsi-target;K8S_POD_INFRA_CONTAINER_ID=4de18b6bcd163bea42cae9f619052080aad749474228630827cd3edaa2bf532b'}: pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last):
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 81, in add
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     vif = self.plugin.add(params)
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 51, in add
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     vifs = self._do_work(params, b_base.connect)
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 155, in _do_work
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     container_id=params.CNI_CONTAINERID)
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/base.py", line 143, in connect
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     driver.connect(vif, ifname, netns, container_id)
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 70, in connect
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     iface.net_ns_fd = utils.convert_netns(netns)
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/transactional.py", line 209, in __exit__
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     self.commit()
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 649, in commit
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     raise newif
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 588, in commit
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     self.nl.link('add', **request)
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/pyroute2/iproute/linux.py", line 1115, in link
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     msg_flags=msg_flags)
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 352, in nlm_request
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     return tuple(self._genlm_request(*argv, **kwarg))
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 836, in nlm_request
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     callback=callback):
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 355, in get
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     return tuple(self._genlm_get(*argv, **kwarg))
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 679, in get
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     raise msg['header']['error']
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service 
2020-01-08 09:03:24.667 37 INFO werkzeug [-] 127.0.0.1 - - [08/Jan/2020 09:03:24] "POST /addNetwork HTTP/1.1" 500 -
2020-01-08 09:03:24.696 52 INFO os_vif [-] Successfully unplugged vif VIFVlanNested(active=True,address=fa:16:3e:54:cf:9d,has_traffic_filtering=False,id=03464583-05be-4cfd-bbef-ecc485c3658c,network=Network(d049a1a6-bb45-411d-b5e1-124a17dd301b),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tap03464583-05',vlan_id=3386)
2020-01-08 09:03:24.703 52 INFO werkzeug [-] 127.0.0.1 - - [08/Jan/2020 09:03:24] "POST /delNetwork HTTP/1.1" 204 -
2020-01-08 09:03:24.766 54 INFO os_vif [-] Successfully unplugged vif VIFVlanNested(active=True,address=fa:16:3e:54:cf:9d,has_traffic_filtering=False,id=03464583-05be-4cfd-bbef-ecc485c3658c,network=Network(d049a1a6-bb45-411d-b5e1-124a17dd301b),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tap03464583-05',vlan_id=3386)
2020-01-08 09:03:24.773 54 INFO werkzeug [-] 127.0.0.1 - - [08/Jan/2020 09:03:24] "POST /delNetwork HTTP/1.1" 204 -
2020-01-08 09:03:27.887 31 INFO kuryr_kubernetes.cni.health [-] CNI driver readiness verified.
2020-01-08 09:03:27.887 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020 09:03:27] "GET /ready HTTP/1.1" 200 -
2020-01-08 09:03:32.648 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020 09:03:32] "GET /alive HTTP/1.1" 200 -
2020-01-08 09:03:37.880 31 INFO kuryr_kubernetes.cni.health [-] CNI driver readiness verified.
2020-01-08 09:03:37.880 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020 09:03:37] "GET /ready HTTP/1.1" 200 -
2020-01-08 09:03:42.649 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020 09:03:42] "GET /alive HTTP/1.1" 200 -
2020-01-08 09:03:43.604 66 INFO os_vif [-] Successfully plugged vif VIFVlanNested(active=True,address=fa:16:3e:54:cf:9d,has_traffic_filtering=False,id=03464583-05be-4cfd-bbef-ecc485c3658c,network=Network(d049a1a6-bb45-411d-b5e1-124a17dd301b),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tap03464583-05',vlan_id=3386)
2020-01-08 09:03:47.877 31 INFO kuryr_kubernetes.cni.health [-] CNI driver readiness verified.
2020-01-08 09:03:47.878 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020 09:03:47] "GET /ready HTTP/1.1" 200 -
2020-01-08 09:03:52.649 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020 09:03:52] "GET /alive HTTP/1.1" 200 -

Comment 6 Luis Tomas Bolivar 2020-01-08 09:59:01 UTC
(In reply to Chao Yang from comment #5)
> Hi
> 
> I reproduced this issue on
> 
> 4.3.0-0.nightly-2020-01-06-185654
> 
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service [-]
> Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': 'eth0',
> 'CNI_NETNS': '/proc/1909009/ns/net', 'CNI_PATH':
> '/opt/multus/bin:/var/lib/cni/bin', 'CNI_COMMAND': 'ADD', 'CNI_CONTAINERID':
> '4de18b6bcd163bea42cae9f619052080aad749474228630827cd3edaa2bf532b',
> 'CNI_ARGS':
> 'IgnoreUnknown=true;K8S_POD_NAMESPACE=openshift-kuryr;K8S_POD_NAME=iscsi-
> target;
> K8S_POD_INFRA_CONTAINER_ID=4de18b6bcd163bea42cae9f619052080aad749474228630827
> cd3edaa2bf532b'}: pyroute2.netlink.exceptions.NetlinkError: (17, 'File
> exists')
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service
> Traceback (most recent call last):
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/daemon/service.py",
> line 81, in add
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service     vif
> = self.plugin.add(params)
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/
> k8s_cni_registry.py", line 51, in add
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> vifs = self._do_work(params, b_base.connect)
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/
> k8s_cni_registry.py", line 155, in _do_work
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> container_id=params.CNI_CONTAINERID)
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/base.py",
> line 143, in connect
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> driver.connect(vif, ifname, netns, container_id)
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/nested.py",
> line 70, in connect
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> iface.net_ns_fd = utils.convert_netns(netns)
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/pyroute2/ipdb/transactional.py", line 209,
> in __exit__
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> self.commit()
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 649, in
> commit
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> raise newif
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 588, in
> commit
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> self.nl.link('add', **request)
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/pyroute2/iproute/linux.py", line 1115, in
> link
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> msg_flags=msg_flags)
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 352,
> in nlm_request
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> return tuple(self._genlm_request(*argv, **kwarg))
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 836,
> in nlm_request
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> callback=callback):
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 355,
> in get
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> return tuple(self._genlm_get(*argv, **kwarg))
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service   File
> "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 679,
> in get
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service    
> raise msg['header']['error']
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service
> pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
> 2020-01-08 09:03:24.660 37 ERROR kuryr_kubernetes.cni.daemon.service 
> 2020-01-08 09:03:24.667 37 INFO werkzeug [-] 127.0.0.1 - - [08/Jan/2020
> 09:03:24] "POST /addNetwork HTTP/1.1" 500 -
> 2020-01-08 09:03:24.696 52 INFO os_vif [-] Successfully unplugged vif
> VIFVlanNested(active=True,address=fa:16:3e:54:cf:9d,
> has_traffic_filtering=False,id=03464583-05be-4cfd-bbef-ecc485c3658c,
> network=Network(d049a1a6-bb45-411d-b5e1-124a17dd301b),plugin='noop',
> port_profile=<?>,preserve_on_delete=False,vif_name='tap03464583-05',
> vlan_id=3386)
> 2020-01-08 09:03:24.703 52 INFO werkzeug [-] 127.0.0.1 - - [08/Jan/2020
> 09:03:24] "POST /delNetwork HTTP/1.1" 204 -
> 2020-01-08 09:03:24.766 54 INFO os_vif [-] Successfully unplugged vif
> VIFVlanNested(active=True,address=fa:16:3e:54:cf:9d,
> has_traffic_filtering=False,id=03464583-05be-4cfd-bbef-ecc485c3658c,
> network=Network(d049a1a6-bb45-411d-b5e1-124a17dd301b),plugin='noop',
> port_profile=<?>,preserve_on_delete=False,vif_name='tap03464583-05',
> vlan_id=3386)
> 2020-01-08 09:03:24.773 54 INFO werkzeug [-] 127.0.0.1 - - [08/Jan/2020
> 09:03:24] "POST /delNetwork HTTP/1.1" 204 -
> 2020-01-08 09:03:27.887 31 INFO kuryr_kubernetes.cni.health [-] CNI driver
> readiness verified.
> 2020-01-08 09:03:27.887 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020
> 09:03:27] "GET /ready HTTP/1.1" 200 -
> 2020-01-08 09:03:32.648 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020
> 09:03:32] "GET /alive HTTP/1.1" 200 -
> 2020-01-08 09:03:37.880 31 INFO kuryr_kubernetes.cni.health [-] CNI driver
> readiness verified.
> 2020-01-08 09:03:37.880 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020
> 09:03:37] "GET /ready HTTP/1.1" 200 -
> 2020-01-08 09:03:42.649 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020
> 09:03:42] "GET /alive HTTP/1.1" 200 -
> 2020-01-08 09:03:43.604 66 INFO os_vif [-] Successfully plugged vif
> VIFVlanNested(active=True,address=fa:16:3e:54:cf:9d,
> has_traffic_filtering=False,id=03464583-05be-4cfd-bbef-ecc485c3658c,
> network=Network(d049a1a6-bb45-411d-b5e1-124a17dd301b),plugin='noop',
> port_profile=<?>,preserve_on_delete=False,vif_name='tap03464583-05',
> vlan_id=3386)
> 2020-01-08 09:03:47.877 31 INFO kuryr_kubernetes.cni.health [-] CNI driver
> readiness verified.
> 2020-01-08 09:03:47.878 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020
> 09:03:47] "GET /ready HTTP/1.1" 200 -
> 2020-01-08 09:03:52.649 31 INFO werkzeug [-] 10.196.0.14 - - [08/Jan/2020
> 09:03:52] "GET /alive HTTP/1.1" 200 -


Weird, that interface is supposed to be removed by the remove_ifaces method. Was the kuryr-cni got restarted? In the logs you posted, is there any WARNING message including: "Found hanging interface %(ifname)s inside...."?

Comment 7 Jon Uriarte 2020-01-08 14:49:53 UTC
Moving this BZ back to Assigned after some quick debugging with Luis.

The kuryr-cni pod running in a master node is restarting continuously, as it keeps unhealthy.
The WARNING message including: "Found hanging interface %(ifname)s inside...." is not present in kuryr-cni pod logs.

We've checked the fix [1] is included in the deployment (4.3.0-0.nightly-2020-01-06-185654).

$ oc get pods -n openshift-kuryr -o wide                                                                                                                                        
NAME                                   READY   STATUS              RESTARTS   AGE     IP             NODE                     
iscsi-target                           0/1     ContainerCreating   0          5h44m   <none>         ostest-vqq5d-worker-6jcw4
kuryr-cni-gqpgb                        1/1     Running             0          24h     10.196.0.39    ostest-vqq5d-worker-9tn5z
kuryr-cni-hbtj9                        1/1     Running             0          25h     10.196.0.13    ostest-vqq5d-master-0    
kuryr-cni-m7vrl                        0/1     CrashLoopBackOff    59         5h44m   10.196.0.14    ostest-vqq5d-worker-6jcw4
kuryr-cni-qwlxw                        1/1     Running             0          24h     10.196.0.28    ostest-vqq5d-worker-ltmfm
kuryr-cni-rzp9g                        1/1     Running             0          25h     10.196.0.33    ostest-vqq5d-master-1    
kuryr-cni-t4c7g                        1/1     Running             0          25h     10.196.0.26    ostest-vqq5d-master-2    
kuryr-controller-bdd967688-h9nzx       1/1     Running             8          25h     10.196.0.13    ostest-vqq5d-master-0    
kuryr-dns-admission-controller-6d979   1/1     Running             0          25h     10.128.46.13   ostest-vqq5d-master-1    
kuryr-dns-admission-controller-mqrwv   1/1     Running             0          25h     10.128.46.10   ostest-vqq5d-master-2    
kuryr-dns-admission-controller-rgqss   1/1     Running             0          25h     10.128.46.21   ostest-vqq5d-master-0    

$ oc get nodes
NAME                        STATUS                     ROLES    AGE   VERSION
ostest-vqq5d-master-0       Ready                      master   25h   v1.16.2
ostest-vqq5d-master-1       Ready                      master   25h   v1.16.2
ostest-vqq5d-master-2       Ready                      master   25h   v1.16.2
ostest-vqq5d-worker-6jcw4   Ready                      worker   24h   v1.16.2
ostest-vqq5d-worker-9tn5z   Ready,SchedulingDisabled   worker   24h   v1.16.2
ostest-vqq5d-worker-ltmfm   Ready,SchedulingDisabled   worker   24h   v1.16.2


The pod iscsi-target cannot be created since the network is failing.

$ oc get pods -n openshift-kuryr describe pod iscsi-target                                                                                                              [9/1910]
NAME           READY   STATUS              RESTARTS   AGE
iscsi-target   0/1     ContainerCreating   0          6h3m
Error from server (NotFound): pods "describe" not found
Error from server (NotFound): pods "pod" not found
(shiftstack) [stack@undercloud-0 ~]$ oc -n openshift-kuryr describe pod iscsi-target                                                                                                                               
Name:         iscsi-target
Namespace:    openshift-kuryr
Priority:     0
Node:         ostest-vqq5d-worker-6jcw4/10.196.0.14
Start Time:   Wed, 08 Jan 2020 03:39:24 -0500
Labels:       storage=iscsi-target
Annotations:  k8s.v1.cni.cncf.io/networks-status: 
              openshift.io/description: iscsi target
              openstack.org/kuryr-pod-label: {"storage": "iscsi-target"}
              openstack.org/kuryr-vif:
                {"versioned_object.changes": ["default_vif"], "versioned_object.data": {"additional_vifs": {}, "default_vif": {"versioned_object.changes":...
Status:       Pending
IP:           
IPs:          <none>
Containers:
  iscsi-target:
    Container ID:   
    Image:          aosqe/iscsi
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Readiness:      exec [targetcli ls /iscsi/iqn.2016-04.test.com:storage.target00/tpg1] delay=30s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /lib/modules from kernel (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-ctxjw (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kernel:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  default-token-ctxjw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-ctxjw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                     From                                Message
  ----     ------                  ----                    ----                                -------
  Warning  FailedCreatePodSandBox  3m18s (x885 over 6h1m)  kubelet, ostest-vqq5d-worker-6jcw4  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_iscsi-target_openshift-kuryr_cc05883f-cdc1-47d7-b740-d2a8350c8052_0(0b7b71eb8b381dc70af6d18693386480f9a93dcd414d9be3ab5e7b683a82fe90): Multus: error adding pod to network "kuryr": delegateAdd: error invoking DelegateAdd - "kuryr-cni": error in getting result from AddNetwork: CNI Daemon returned error 500


[1] https://github.com/openshift/kuryr-kubernetes/pull/120

Comment 17 Jon Uriarte 2020-03-24 15:59:21 UTC
Verified in 4.3.0-0.nightly-2020-03-23-130439 on top of OSP 16 RHOS_TRUNK-16.0-RHEL-8-20200226.n.1

After running kubernetes Network Policy tests and conformance tests, I couldn't find the error described in the BZ
in any of the kuryr cni pods.

As https://bugzilla.redhat.com/show_bug.cgi?id=1779163#c4 says, this bug is really hard and random to reproduce so do not
hesitate to reopen this issue if it appears again.

Comment 19 errata-xmlrpc 2020-04-01 19:10:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0930

Comment 20 Jon Uriarte 2020-04-07 13:35:01 UTC
Reopening as it was reproduced in OCP 4.5 and the fix needs to be backported:

Failed QA in OCP 4.5.0-0.nightly-2020-04-07-040639 on OSP 16 RHOS_TRUNK-16.0-RHEL-8-20200403.n.1 (with OVS).

The installer fails waiting for the cluster to initialize, and one kury-cni pod remains in crashloop:

$ oc get pods -n openshift-kuryr
NAME                              READY   STATUS    RESTARTS   AGE
kuryr-cni-7sbvw                   1/1     Running   0          5h11m
kuryr-cni-bg2hj                   1/1     Running   57         5h23m
kuryr-cni-nlxld                   1/1     Running   4          5h43m
kuryr-cni-sc922                   1/1     Running   2          5h43m
kuryr-cni-vfb7n                   1/1     Running   2          5h43m
kuryr-cni-vsvnc                   1/1     Running   0          5h11m
kuryr-controller-9dff6456-7kjw9   1/1     Running   1          5h43m


ERROR kuryr_kubernetes.cni.binding.nested [-] Creation of pod interface failed, most likely due to duplicated VLAN id. This will probably cause kuryr-daemon to crashloop. Trying to gat
 pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
ERROR kuryr_kubernetes.cni.binding.nested Traceback (most recent call last):
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 86, in connect
ERROR kuryr_kubernetes.cni.binding.nested     iface.net_ns_fd = utils.convert_netns(netns) 
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/transactional.py", line 209, in __exit__
ERROR kuryr_kubernetes.cni.binding.nested     self.commit()
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 650, in commit
ERROR kuryr_kubernetes.cni.binding.nested     raise newif
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 589, in commit
ERROR kuryr_kubernetes.cni.binding.nested     self.nl.link('add', **request)
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/iproute/linux.py", line 1163, in link
ERROR kuryr_kubernetes.cni.binding.nested     msg_flags=msg_flags)
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 373, in nlm_request
ERROR kuryr_kubernetes.cni.binding.nested     return tuple(self._genlm_request(*argv, **kwarg))
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 864, in nlm_request
ERROR kuryr_kubernetes.cni.binding.nested     callback=callback):
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 376, in get
ERROR kuryr_kubernetes.cni.binding.nested     return tuple(self._genlm_get(*argv, **kwarg))
ERROR kuryr_kubernetes.cni.binding.nested   File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 701, in get
ERROR kuryr_kubernetes.cni.binding.nested     raise msg['header']['error']
ERROR kuryr_kubernetes.cni.binding.nested pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
ERROR kuryr_kubernetes.cni.binding.nested 
ERROR kuryr_kubernetes.cni.binding.nested [-] List of host interfaces: {1: {'address': '00:00:00:00:00:00', 'broadcast': '00:00:00:00:00:00', 'ifname': 'lo', 'mtu': 65536, 'qdisc': 'no 
queue', 'txqlen': 1000, 'operstate': 'UNKNOWN', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 0, 'proto_down': 0, 'gso_max_segs': 65535, 'g
so_max_size': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 0, 'carrier_down_count': 0, 'index': 1, 'flags': 65609, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipaddr': (('::1', 128)
, ('127.0.0.1', 8)), 'ports': (), 'family': 0, 'ifi_type': 772, 'state': 'up', 'neighbours': ('0.0.0.0',)}, 'lo': {'address': '00:00:00:00:00:00', 'broadcast': '00:00:00:00:00:00', 'ifname': 'lo', 'mtu': 65536,
'qdisc': 'noqueue', 'txqlen': 1000, 'operstate': 'UNKNOWN', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 0, 'proto_down': 0, 'gso_max_segs
': 65535, 'gso_max_size': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 0, 'carrier_down_count': 0, 'index': 1, 'flags': 65609, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipaddr': (
('::1', 128), ('127.0.0.1', 8)), 'ports': (), 'family': 0, 'ifi_type': 772, 'state': 'up', 'neighbours': ('0.0.0.0',)}, 2: {'address': 'fa:16:3e:ff:5e:0f', 'broadcast': 'ff:ff:ff:ff:ff:ff', 'ifname': 'ens3', 'mt
u': 1450, 'qdisc': 'fq_codel', 'txqlen': 1000, 'operstate': 'UP', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 2, 'proto_down': 0, 'gso_ma
x_segs': 65535, 'gso_max_size': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 1, 'carrier_down_count': 1, 'index': 2, 'flags': 69699, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipad
dr': (('fe80::a777:cb9f:a473:bc9a', 64), ('10.196.1.28', 16)), 'ports': (), 'family': 0, 'ifi_type': 1, 'state': 'up', 'neighbours': ('10.196.0.12', '10.196.3.50', '10.196.0.102', '10.196.0.11', '224.0.0.22', '1
0.196.0.1', '10.196.0.5', '224.0.0.18', '10.196.0.7', '10.196.1.206', '224.0.0.251')}, 'ens3': {'address': 'fa:16:3e:ff:5e:0f', 'broadcast': 'ff:ff:ff:ff:ff:ff', 'ifname': 'ens3', 'mtu': 1450, 'qdisc': 'fq_codel
', 'txqlen': 1000, 'operstate': 'UP', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 2, 'proto_down': 0, 'gso_max_segs': 65535, 'gso_max_siz
e': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 1, 'carrier_down_count': 1, 'index': 2, 'flags': 69699, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipaddr': (('fe80::a777:cb9f:a473:bc9a', 64), ('10.196.1.28', 16)), 'ports': (), 'family': 0, 'ifi_type': 1, 'state': 'up', 'neighbours': ('10.196.0.12', '10.196.3.50', '10.196.0.102', '10.196.0.11', '224.0.0.22', '10.196.0.1', '10.196.0.5', '2
24.0.0.18', '10.196.0.7', '10.196.1.206', '224.0.0.251')}}: pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
ERROR kuryr_kubernetes.cni.binding.nested [-] List of pod namespace interfaces: {1: {'address': '00:00:00:00:00:00', 'broadcast': '00:00:00:00:00:00', 'ifname': 'lo', 'mtu': 65536, 'qd
isc': 'noqueue', 'txqlen': 1000, 'operstate': 'UNKNOWN', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 0, 'proto_down': 0, 'gso_max_segs':
65535, 'gso_max_size': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 0, 'carrier_down_count': 0, 'index': 1, 'flags': 65609, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipaddr': ((':
:1', 128), ('127.0.0.1', 8)), 'ports': (), 'family': 0, 'ifi_type': 772, 'state': 'up', 'neighbours': ()}, 'lo': {'address': '00:00:00:00:00:00', 'broadcast': '00:00:00:00:00:00', 'ifname': 'lo', 'mtu': 65536, '
qdisc': 'noqueue', 'txqlen': 1000, 'operstate': 'UNKNOWN', 'linkmode': 0, 'group': 0, 'promiscuity': 0, 'num_tx_queues': 1, 'num_rx_queues': 1, 'carrier': 1, 'carrier_changes': 0, 'proto_down': 0, 'gso_max_segs'
: 65535, 'gso_max_size': 65536, 'xdp': '05:00:02:00:00:00:00:00', 'carrier_up_count': 0, 'carrier_down_count': 0, 'index': 1, 'flags': 65609, 'ipdb_scope': 'system', 'ipdb_priority': 0, 'vlans': (), 'ipaddr': ((
'::1', 128), ('127.0.0.1', 8)), 'ports': (), 'family': 0, 'ifi_type': 772, 'state': 'up', 'neighbours': ()}}: pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params: {'CNI_IFNAME': 'eth0', 'CNI_NETNS': '/proc/134124/ns/net', 'CNI_PATH': '/opt/multus/
bin:/var/lib/cni/bin', 'CNI_COMMAND': 'ADD', 'CNI_CONTAINERID': '97359d9fb8a375cb2d0dcb86baa6981366363d04f912b910a3c3b4308baf1396', 'CNI_ARGS': 'IgnoreUnknown=true;K8S_POD_NAMESPACE=openshift-monitoring;K8S_POD_
NAME=thanos-querier-6ccdccb756-znrtz;K8S_POD_INFRA_CONTAINER_ID=97359d9fb8a375cb2d0dcb86baa6981366363d04f912b910a3c3b4308baf1396'}: pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')

Comment 21 Michał Dulko 2020-07-09 12:10:25 UTC
Hm, seems like bot got stuck with this one?


Note You need to log in before you can comment on or make changes to this bug.