Bug 1891559
Summary: | kuryr-cni cannot start new thread | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | rlobillo | ||||
Component: | Networking | Assignee: | Michał Dulko <mdulko> | ||||
Networking sub component: | kuryr | QA Contact: | GenadiC <gcheresh> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | medium | ||||||
Priority: | high | ||||||
Version: | 4.7 | ||||||
Target Milestone: | --- | ||||||
Target Release: | 4.7.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: Outdated pyroute2 library was leaking processes.
Consequence: kuryr-cni pods were exceeding thread number limit and getting restarted when spawning a new one failed.
Fix: pyroute2 got upgraded to 0.5.13.
Result: kuryr-cni pods are no longer getting restarted.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-02-24 15:28:28 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1902204 | ||||||
Attachments: |
|
[stack@undercloud-0 ~]$ oc get pods -n openshift-kuryr -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kuryr-cni-6wbpb 1/1 Running 2 2d10h 10.196.0.123 ostest-64tv9-worker-0-rdmqt <none> <none> kuryr-cni-8rnvz 1/1 Running 8 2d10h 10.196.2.75 ostest-64tv9-worker-0-qhmz9 <none> <none> kuryr-cni-cw6xq 1/1 Running 0 2d10h 10.196.0.16 ostest-64tv9-master-1 <none> <none> kuryr-cni-hjnzr 1/1 Running 0 2d10h 10.196.1.49 ostest-64tv9-master-2 <none> <none> kuryr-cni-rx7tn 1/1 Running 2 2d10h 10.196.1.70 ostest-64tv9-worker-0-w4tv4 <none> <none> kuryr-cni-sfw8h 1/1 Running 0 2d10h 10.196.0.26 ostest-64tv9-master-0 <none> <none> kuryr-controller-7ccfc75cf4-5qzpt 1/1 Running 0 2d10h 10.196.1.49 ostest-64tv9-master-2 <none> <none> I'm moving it to POST as we'll only wait for build now that newer pyroute2 is tagged into AOS. Verified on OCP4.7.0-0.nightly-2020-11-25-015010 over OSP16.1 with OVN-Octavia (RHOS-16.1-RHEL-8-20201110.n.1) and OSP13 with Amphoras (2020-11-13.1). After 3 days running, no kuryr-cni pods restarts are observed: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-11-25-015010 True False 4d22h Cluster version is 4.7.0-0.nightly-2020-11-25-015010 [stack@undercloud-0 ~]$ oc get pods -n openshift-kuryr NAME READY STATUS RESTARTS AGE kuryr-cni-58l79 1/1 Running 0 3d18h kuryr-cni-ct2d7 1/1 Running 0 3d18h kuryr-cni-hz5kx 1/1 Running 0 3d18h kuryr-cni-jr5kv 1/1 Running 0 3d18h kuryr-cni-kzm4h 1/1 Running 0 3d18h kuryr-cni-l4s5m 1/1 Running 0 3d18h kuryr-controller-5bcb56996c-d8hz9 1/1 Running 0 3d18h ********** $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-11-25-015010 True False 5d Cluster version is 4.7.0-0.nightly-2020-11-25-015010 $ oc get pods -n openshift-kuryr NAME READY STATUS RESTARTS AGE kuryr-cni-4chc5 1/1 Running 0 4d1h kuryr-cni-bwbg2 1/1 Running 0 4d1h kuryr-cni-cvtnx 1/1 Running 0 4d1h kuryr-cni-j6jjb 1/1 Running 0 4d1h kuryr-cni-kjc7q 1/1 Running 0 4d1h kuryr-cni-s6vw5 1/1 Running 0 4d1h kuryr-controller-dccc57798-9mhlk 1/1 Running 0 4d Furthermore, the pyroute2 version present on CNI pods is 0.5.13: $ oc rsh -n openshift-kuryr kuryr-cni-58l79 sh-4.4# pip3 show pyroute2 Name: pyroute2 Version: 0.5.13 Summary: Python Netlink library Home-page: https://github.com/svinota/pyroute2 Author: Peter V. Saveliev Author-email: peter License: dual license GPLv2+ and Apache v2 Location: /usr/lib/python3.6/site-packages Requires: ********** $ oc rsh -n openshift-kuryr kuryr-cni-4chc5 sh-4.4# pip3 show pyroute2 Name: pyroute2 Version: 0.5.13 Summary: Python Netlink library Home-page: https://github.com/svinota/pyroute2 Author: Peter V. Saveliev Author-email: peter License: dual license GPLv2+ and Apache v2 Location: /usr/lib/python3.6/site-packages Requires: Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |
Created attachment 1724235 [details] kuryr-cni logs hitting the issue Description of problem: After running the cluster for two days, one kuryr-cni is restarted from time to time with below exception: 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing delNetwork request. CNI Params: {'CNI_IFNAME': 'eth0', 'CNI_NETNS': '/var/run/netns/0683b94a-107f-4c25-ae60-564be486d8fa', 'CNI_PATH': '/opt/multus/bin:/var/lib/cni/bin:/usr/libexec/cni', 'CNI_COMMAND': 'DEL', 'CNI_CONTAINERID': '7ca11ff7f3b671253a8b3d37fdacf79a5980d01808d4a60994f46e21519b7983', 'CNI_ARGS': 'IgnoreUnknown=true;K8S_POD_NAMESPACE=openshift-marketplace;K8S_POD_NAME=certified-operators-6qtw4;K8S_POD_INFRA_CONTAINER_ID=7ca11ff7f3b671253a8b3d37fdacf79a5980d01808d4a60994f46e21519b7983'}.: RuntimeError: can't start new thread 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last): 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 122, in delete 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service self.plugin.delete(params) 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 127, in delete 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service self._do_work(params, b_base.disconnect, 5) 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 188, in _do_work 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service container_id=params.CNI_CONTAINERID) 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/base.py", line 169, in disconnect 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service driver.disconnect(vif, ifname, netns, container_id) 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 141, in disconnect 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service with b_base.get_ipdb(netns) as c_ipdb: 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/base.py", line 71, in get_ipdb 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service ipdb = pyroute2.IPDB(nl=pyroute2.NetNS(netns)) 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/main.py", line 880, in __init__ 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service self.initdb() 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/main.py", line 984, in initdb 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service tx.start() 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib64/python3.6/threading.py", line 846, in start 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service _start_new_thread(self._bootstrap, ()) 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service RuntimeError: can't start new thread 2020-10-26 05:47:40.975 24090 ERROR kuryr_kubernetes.cni.daemon.service [00m 2020-10-26 05:47:40.994 24090 INFO werkzeug [-] 127.0.0.1 - - [26/Oct/2020 05:47:40] "[35m[1mPOST /delNetwork HTTP/1.1[0m" 500 -[00m 2020-10-26 05:47:41.347 24100 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing delNetwork request. CNI Params: {'CNI_IFNAME': 'eth0', 'CNI_NETNS': '/var/run/netns/0683b94a-107f-4c25-ae60-564be486d8fa', 'CNI_PATH': '/opt/multus/bin:/var/lib/cni/bin:/usr/libexec/cni', 'CNI_COMMAND': 'DEL', 'CNI_CONTAINERID': '7ca11ff7f3b671253a8b3d37fdacf79a5980d01808d4a60994f46e21519b7983', 'CNI_ARGS': 'IgnoreUnknown=true;K8S_POD_NAMESPACE=openshift-marketplace;K8S_POD_NAME=certified-operators-6qtw4;K8S_POD_INFRA_CONTAINER_ID=7ca11ff7f3b671253a8b3d37fdacf79a5980d01808d4a60994f46e21519b7983'}.: BlockingIOError: [Errno 11] Resource temporarily unavailable This happened sporadically: $ oc get pods -n openshift-kuryr NAME READY STATUS RESTARTS AGE kuryr-cni-6wbpb 1/1 Running 2 2d6h kuryr-cni-8rnvz 1/1 Running 8 2d6h kuryr-cni-cw6xq 1/1 Running 0 2d6h kuryr-cni-hjnzr 1/1 Running 0 2d6h kuryr-cni-rx7tn 1/1 Running 2 2d6h kuryr-cni-sfw8h 1/1 Running 0 2d6h kuryr-controller-7ccfc75cf4-5qzpt 1/1 Running 0 2d6h Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-10-24-042953 over OSP16.1 (RHOS-16.1-RHEL-8-20201007.n.0) How reproducible: observed once. Steps to Reproduce: 1. Install latest OCP4.7 over OSP16.1 with OVN-OCtavia. 2. 3. Actual results: one CNI is being restart from time to time Expected results: CNI should not be restarted. Additional info: kuryr-cni logs attached.