Bug 1891559 - kuryr-cni cannot start new thread
Summary: kuryr-cni cannot start new thread
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.7.0
Assignee: Michał Dulko
QA Contact: GenadiC
URL:
Whiteboard:
Depends On:
Blocks: 1902204
TreeView+ depends on / blocked
 
Reported: 2020-10-26 16:01 UTC by rlobillo
Modified: 2021-02-24 15:29 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Outdated pyroute2 library was leaking processes. Consequence: kuryr-cni pods were exceeding thread number limit and getting restarted when spawning a new one failed. Fix: pyroute2 got upgraded to 0.5.13. Result: kuryr-cni pods are no longer getting restarted.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:28:28 UTC
Target Upstream Version:


Attachments (Terms of Use)
kuryr-cni logs hitting the issue (497.29 KB, text/plain)
2020-10-26 16:01 UTC, rlobillo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift kuryr-kubernetes pull 415 0 None closed Bug 1891559: Force pyroute2 0.5.7 2021-01-08 09:01:34 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:29:04 UTC

Description rlobillo 2020-10-26 16:01:31 UTC
Created attachment 1724235 [details]
kuryr-cni logs hitting the issue

Description of problem:

After running the cluster for two days, one kuryr-cni is restarted from time to time with below exception:

2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing delNetwork request. CNI Params: {'CNI_IFNAME': 'eth0', 'CNI_NETNS': '/var/run/netns/0683b94a-107f-4c25-ae60-564be486d8fa', 'CNI_PATH': '/opt/multus/bin:/var/lib/cni/bin:/usr/libexec/cni', 'CNI_COMMAND': 'DEL', 'CNI_CONTAINERID': '7ca11ff7f3b671253a8b3d37fdacf79a5980d01808d4a60994f46e21519b7983', 'CNI_ARGS': 'IgnoreUnknown=true;K8S_POD_NAMESPACE=openshift-marketplace;K8S_POD_NAME=certified-operators-6qtw4;K8S_POD_INFRA_CONTAINER_ID=7ca11ff7f3b671253a8b3d37fdacf79a5980d01808d4a60994f46e21519b7983'}.: RuntimeError: can't start new thread
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last):
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 122, in delete
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service     self.plugin.delete(params)
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 127, in delete
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service     self._do_work(params, b_base.disconnect, 5)
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 188, in _do_work
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service     container_id=params.CNI_CONTAINERID)
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/base.py", line 169, in disconnect
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service     driver.disconnect(vif, ifname, netns, container_id)
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 141, in disconnect
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service     with b_base.get_ipdb(netns) as c_ipdb:
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/base.py", line 71, in get_ipdb
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service     ipdb = pyroute2.IPDB(nl=pyroute2.NetNS(netns))
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/main.py", line 880, in __init__
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service     self.initdb()
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/main.py", line 984, in initdb
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service     tx.start()
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service   File "/usr/lib64/python3.6/threading.py", line 846, in start
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service     _start_new_thread(self._bootstrap, ())
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service RuntimeError: can't start new thread
2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service [00m
2020-10-26 05:47:40.994 24090 INFO werkzeug [-] 127.0.0.1 - - [26/Oct/2020 05:47:40] "[35m[1mPOST /delNetwork HTTP/1.1[0m" 500 -[00m
2020-10-26 05:47:41.347 24100 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing delNetwork request. CNI Params: {'CNI_IFNAME': 'eth0', 'CNI_NETNS': '/var/run/netns/0683b94a-107f-4c25-ae60-564be486d8fa', 'CNI_PATH': '/opt/multus/bin:/var/lib/cni/bin:/usr/libexec/cni', 'CNI_COMMAND': 'DEL', 'CNI_CONTAINERID': '7ca11ff7f3b671253a8b3d37fdacf79a5980d01808d4a60994f46e21519b7983', 'CNI_ARGS': 'IgnoreUnknown=true;K8S_POD_NAMESPACE=openshift-marketplace;K8S_POD_NAME=certified-operators-6qtw4;K8S_POD_INFRA_CONTAINER_ID=7ca11ff7f3b671253a8b3d37fdacf79a5980d01808d4a60994f46e21519b7983'}.: BlockingIOError: [Errno 11] Resource temporarily unavailable

This happened sporadically:

$ oc get pods -n openshift-kuryr
NAME                                READY   STATUS    RESTARTS   AGE
kuryr-cni-6wbpb                     1/1     Running   2          2d6h
kuryr-cni-8rnvz                     1/1     Running   8          2d6h
kuryr-cni-cw6xq                     1/1     Running   0          2d6h
kuryr-cni-hjnzr                     1/1     Running   0          2d6h
kuryr-cni-rx7tn                     1/1     Running   2          2d6h
kuryr-cni-sfw8h                     1/1     Running   0          2d6h
kuryr-controller-7ccfc75cf4-5qzpt   1/1     Running   0          2d6h

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-10-24-042953 over OSP16.1 (RHOS-16.1-RHEL-8-20201007.n.0)

How reproducible: observed once.


Steps to Reproduce:
1. Install latest OCP4.7 over OSP16.1 with OVN-OCtavia.
2.
3.

Actual results: one CNI is being restart from time to time


Expected results: CNI should not be restarted.


Additional info: kuryr-cni logs attached.

Comment 1 rlobillo 2020-10-26 19:35:50 UTC
[stack@undercloud-0 ~]$ oc get pods -n openshift-kuryr -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP             NODE                          NOMINATED NODE   READINESS GATES
kuryr-cni-6wbpb                     1/1     Running   2          2d10h   10.196.0.123   ostest-64tv9-worker-0-rdmqt   <none>           <none>
kuryr-cni-8rnvz                     1/1     Running   8          2d10h   10.196.2.75    ostest-64tv9-worker-0-qhmz9   <none>           <none>
kuryr-cni-cw6xq                     1/1     Running   0          2d10h   10.196.0.16    ostest-64tv9-master-1         <none>           <none>
kuryr-cni-hjnzr                     1/1     Running   0          2d10h   10.196.1.49    ostest-64tv9-master-2         <none>           <none>
kuryr-cni-rx7tn                     1/1     Running   2          2d10h   10.196.1.70    ostest-64tv9-worker-0-w4tv4   <none>           <none>
kuryr-cni-sfw8h                     1/1     Running   0          2d10h   10.196.0.26    ostest-64tv9-master-0         <none>           <none>
kuryr-controller-7ccfc75cf4-5qzpt   1/1     Running   0          2d10h   10.196.1.49    ostest-64tv9-master-2         <none>           <none>

Comment 2 Michał Dulko 2020-11-23 15:43:15 UTC
I'm moving it to POST as we'll only wait for build now that newer pyroute2 is tagged into AOS.

Comment 4 rlobillo 2020-11-30 09:06:20 UTC
Verified on OCP4.7.0-0.nightly-2020-11-25-015010 over OSP16.1 with OVN-Octavia (RHOS-16.1-RHEL-8-20201110.n.1) and OSP13 with Amphoras (2020-11-13.1).

After 3 days running, no kuryr-cni pods restarts are observed:


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-25-015010   True        False         4d22h   Cluster version is 4.7.0-0.nightly-2020-11-25-015010
[stack@undercloud-0 ~]$ oc get pods -n openshift-kuryr
NAME                                READY   STATUS    RESTARTS   AGE
kuryr-cni-58l79                     1/1     Running   0          3d18h
kuryr-cni-ct2d7                     1/1     Running   0          3d18h
kuryr-cni-hz5kx                     1/1     Running   0          3d18h
kuryr-cni-jr5kv                     1/1     Running   0          3d18h
kuryr-cni-kzm4h                     1/1     Running   0          3d18h
kuryr-cni-l4s5m                     1/1     Running   0          3d18h
kuryr-controller-5bcb56996c-d8hz9   1/1     Running   0          3d18h

**********

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-25-015010   True        False         5d      Cluster version is 4.7.0-0.nightly-2020-11-25-015010
$ oc get pods -n openshift-kuryr
NAME                               READY   STATUS    RESTARTS   AGE
kuryr-cni-4chc5                    1/1     Running   0          4d1h
kuryr-cni-bwbg2                    1/1     Running   0          4d1h
kuryr-cni-cvtnx                    1/1     Running   0          4d1h
kuryr-cni-j6jjb                    1/1     Running   0          4d1h
kuryr-cni-kjc7q                    1/1     Running   0          4d1h
kuryr-cni-s6vw5                    1/1     Running   0          4d1h
kuryr-controller-dccc57798-9mhlk   1/1     Running   0          4d


Furthermore, the pyroute2 version present on CNI pods is 0.5.13:

$ oc rsh -n openshift-kuryr kuryr-cni-58l79
sh-4.4# pip3 show pyroute2
Name: pyroute2
Version: 0.5.13
Summary: Python Netlink library
Home-page: https://github.com/svinota/pyroute2
Author: Peter V. Saveliev
Author-email: peter@svinota.eu
License: dual license GPLv2+ and Apache v2
Location: /usr/lib/python3.6/site-packages
Requires: 

**********

$ oc rsh -n openshift-kuryr kuryr-cni-4chc5                                                                                                                  
sh-4.4# pip3 show pyroute2
Name: pyroute2
Version: 0.5.13
Summary: Python Netlink library
Home-page: https://github.com/svinota/pyroute2
Author: Peter V. Saveliev
Author-email: peter@svinota.eu
License: dual license GPLv2+ and Apache v2
Location: /usr/lib/python3.6/site-packages
Requires:

Comment 7 errata-xmlrpc 2021-02-24 15:28:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.