Bug 1993204 - Slowness in 501 lr policies cleanup on a ICNI2 env
Summary: Slowness in 501 lr policies cleanup on a ICNI2 env
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ---
: 4.10.0
Assignee: Tim Rozet
QA Contact: Jose Castillo Lema
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-12 14:30 UTC by Jose Castillo Lema
Modified: 2022-03-16 11:12 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-16 11:12:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Logs from the ovnkube-master (3.08 MB, text/plain)
2021-08-12 14:30 UTC, Jose Castillo Lema
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 750 0 None Merged Bug 1959352: phase 2 scale improvements 2021-10-20 16:04:07 UTC
Github ovn-org ovn-kubernetes pull 2516 0 None Merged Reduce nsInfo contention on external gateway ops 2021-09-30 20:49:04 UTC
Red Hat Product Errata RHBA-2022:0811 0 None None None 2022-03-16 11:12:32 UTC

Description Jose Castillo Lema 2021-08-12 14:30:43 UTC
Created attachment 1813495 [details]
Logs from the ovnkube-master

Description of problem:
After cleaning a namespace with 18 app pods/worker in a ICNI2 environment, corresponding 501 lr policies take aprox. 2 hours to be cleaned up (more than twice the time it took to create the pods).

Version-Release number of selected component (if applicable):
OCP: 4.7.11

How reproducible:

Steps to Reproduce:
1. Create 18 app pods/worker in a namespace
2. Delete the pods and corresponding namespace

Actual results:
501 lr policies takes 2 hours to be cleaned up

Expected results:


Additional info:
No apparent errors in the logs (attaching the logs to the bugzilla).

Comment 1 Jaime Caamaño Ruiz 2021-08-19 12:07:29 UTC
What I can see from the logs is that the served-ns namespace was removed along with 2017 pods contained within. On the excerpt of the log it can be seen that around 40 gw routes/policies are removed per pod and also on each removal a lookup is made on the route table to check if there are no remaining routes for an outport/gateway with BFD references to check if BFD can be removed as well. This takes 150ms. If we do the math it would take more than 3 hours, perhaps as the route table is cleared up it takes less time reducing that to the 2 hours observed. While 150ms is a lot of time, it can not be ruled as unreasonable without knowing how filled the table was considering that is is un-indexed.

I am going to set this as blocker- given that there is no indication a major issue yet.

Could this be reproduced on a clean run with must-gather which would allow us to see the state of ND DB?

Comment 3 Surya Seetharaman 2021-09-28 08:57:12 UTC
Tim has a fix where we split this into per pod and new cache for routes instead of using one giant cache for everything: https://github.com/openshift/ovn-kubernetes/pull/770/commits/985c99c42e05f58f5cee19716bc052f02d2e0094

Comment 4 Surya Seetharaman 2021-09-28 13:15:21 UTC
https://github.com/ovn-org/ovn-kubernetes/pull/2516

Comment 5 Surya Seetharaman 2021-10-20 16:04:54 UTC
https://github.com/openshift/ovn-kubernetes/pull/750 is merged, can we get this verified?

Comment 12 Anurag saxena 2022-01-31 15:51:05 UTC
@

Comment 22 Jose Castillo Lema 2022-03-08 15:00:31 UTC
Validated on a 4.10.0-rc.6 environment with 118 workers.
Created 2124 app pods on the same app namespace.
The deletion of the 501 lr policies was immediate after the deletion of the corresponding namespaces.

Thanks,

Comment 25 errata-xmlrpc 2022-03-16 11:12:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.4 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0811


Note You need to log in before you can comment on or make changes to this bug.