Bug 2003193
Summary: | Kubelet/crio leaks netns and veth ports in the host | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Tim Rozet <trozet> | |
Component: | Node | Assignee: | Peter Hunt <pehunt> | |
Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | high | CC: | aos-bugs, cback, cgoncalves, danw, dblack, ealcaniz, eglottma, eminguez, gwest, jpradhan, pehunt, rjamadar, smalleni, sscheink, vpickard, wking | |
Version: | 4.8 | |||
Target Milestone: | --- | |||
Target Release: | 4.10.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | perfscale-ovn | |||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
Pod namespaces (network, IPC, UTS) managed by CRI-O were only unmounted when the pod was removed
Consequence:
It seemed as though the namespaces were being leaked, as the kubelet takes a long time to remove pods
Fix:
Unmount and remove the namespaces on pod stop
Result:
It does not appear as though the namespaces are leaked
|
Story Points: | --- | |
Clone Of: | ||||
: | 2026386 2028126 2028127 (view as bug list) | Environment: | ||
Last Closed: | 2022-03-10 16:09:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2012836, 2026386, 2026388, 2028126, 2028127, 2078400 |
Description
Tim Rozet
2021-09-10 14:52:44 UTC
can you describe a concise reproducer so we can observe the netns/veths not being cleaned up? With 4.9, we can reproduce by running the node-density-lite scale test on a 20 node aws cluster repeatedly. In this case OVS has around 200 ports, but the host has over 2100 leftover netns and veths. I can reproduce it for you if you want. I think Dan Winship is going to add some more information to this bz with what he has found as well. You don't need to do any scale stuff. We only *noticed* it at scale, but it will happen if you just create one pod and then delete it. the node-density-lite test will: 1. create 249 pods per node total, at a pod creation rate of 20/sec in a test namespace 2. after the test is complete, delete the namespace 3. re-run steps 1 and 2 multiple times oh, and it appears to have started in 4.8. Earlier releases cleaned everything up properly. Filed https://bugzilla.redhat.com/show_bug.cgi?id=2003195 for OVN to ensure the host veths are removed on CNI delete or add failure. (In reply to Dan Winship from comment #5) > oh, and it appears to have started in 4.8. Earlier releases cleaned > everything up properly. Sorry, I screwed up my testing before. 4.7 has the bug too. So my test results are: 4.4 nightly: not buggy 4.7 nightly: buggy 4.8.1: buggy 4.8 nightly: buggy master: buggy fixed by attached PR *** Bug 2025329 has been marked as a duplicate of this bug. *** PR merged Verified on 4.10.0-0.nightly-2021-12-06-201335. Created pods and checked veth ports on a node while pods were running and after pods were deleted. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-12-06-201335 True False 3h52m Cluster version is 4.10.0-0.nightly-2021-12-06-201335 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |