Created attachment 1813495 [details] Logs from the ovnkube-master Description of problem: After cleaning a namespace with 18 app pods/worker in a ICNI2 environment, corresponding 501 lr policies take aprox. 2 hours to be cleaned up (more than twice the time it took to create the pods). Version-Release number of selected component (if applicable): OCP: 4.7.11 How reproducible: Steps to Reproduce: 1. Create 18 app pods/worker in a namespace 2. Delete the pods and corresponding namespace Actual results: 501 lr policies takes 2 hours to be cleaned up Expected results: Additional info: No apparent errors in the logs (attaching the logs to the bugzilla).
What I can see from the logs is that the served-ns namespace was removed along with 2017 pods contained within. On the excerpt of the log it can be seen that around 40 gw routes/policies are removed per pod and also on each removal a lookup is made on the route table to check if there are no remaining routes for an outport/gateway with BFD references to check if BFD can be removed as well. This takes 150ms. If we do the math it would take more than 3 hours, perhaps as the route table is cleared up it takes less time reducing that to the 2 hours observed. While 150ms is a lot of time, it can not be ruled as unreasonable without knowing how filled the table was considering that is is un-indexed. I am going to set this as blocker- given that there is no indication a major issue yet. Could this be reproduced on a clean run with must-gather which would allow us to see the state of ND DB?
Tim has a fix where we split this into per pod and new cache for routes instead of using one giant cache for everything: https://github.com/openshift/ovn-kubernetes/pull/770/commits/985c99c42e05f58f5cee19716bc052f02d2e0094
https://github.com/ovn-org/ovn-kubernetes/pull/2516
https://github.com/openshift/ovn-kubernetes/pull/750 is merged, can we get this verified?
@
Validated on a 4.10.0-rc.6 environment with 118 workers. Created 2124 app pods on the same app namespace. The deletion of the 501 lr policies was immediate after the deletion of the corresponding namespaces. Thanks,
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.4 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0811