Description of problem: With openshift-sdn and OVN, veth ports are used to connect containers to OVS. The CNI will delete the OVS side of the veth port, while kubelet/crio is supposed to garbage collect and clean up leftover netns/veths after the pod is deleted. However, this is not happening and it causes the number of leaked veths to build up in the system over time. Consequently this drives OVS CPU to 100%, because OVS will iterate over all the ports in the host during certain events: https://github.com/openvswitch/ovs/blob/f686957c9667ae962fb8fc003be2a5482e380d75/lib/netdev.c#L2191 The end result is over time on a node, pod latency and other performance impacts will occur due to these leaked ports.
can you describe a concise reproducer so we can observe the netns/veths not being cleaned up?
With 4.9, we can reproduce by running the node-density-lite scale test on a 20 node aws cluster repeatedly. In this case OVS has around 200 ports, but the host has over 2100 leftover netns and veths. I can reproduce it for you if you want. I think Dan Winship is going to add some more information to this bz with what he has found as well.
You don't need to do any scale stuff. We only *noticed* it at scale, but it will happen if you just create one pod and then delete it.
the node-density-lite test will: 1. create 249 pods per node total, at a pod creation rate of 20/sec in a test namespace 2. after the test is complete, delete the namespace 3. re-run steps 1 and 2 multiple times
oh, and it appears to have started in 4.8. Earlier releases cleaned everything up properly.
Filed https://bugzilla.redhat.com/show_bug.cgi?id=2003195 for OVN to ensure the host veths are removed on CNI delete or add failure.
(In reply to Dan Winship from comment #5) > oh, and it appears to have started in 4.8. Earlier releases cleaned > everything up properly. Sorry, I screwed up my testing before. 4.7 has the bug too. So my test results are: 4.4 nightly: not buggy 4.7 nightly: buggy 4.8.1: buggy 4.8 nightly: buggy master: buggy
fixed by attached PR
*** Bug 2025329 has been marked as a duplicate of this bug. ***
PR merged
Verified on 4.10.0-0.nightly-2021-12-06-201335. Created pods and checked veth ports on a node while pods were running and after pods were deleted. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-12-06-201335 True False 3h52m Cluster version is 4.10.0-0.nightly-2021-12-06-201335
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days