Bug 1730345 - Memory Leak in neutron-keepalived-state-change when creating HA routers leads to OOM on all three controllers
Summary: Memory Leak in neutron-keepalived-state-change when creating HA routers leads...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Rodolfo Alonso
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-16 13:54 UTC by Sai Sindhur Malleni
Modified: 2019-10-23 08:30 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-26 13:50:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1728188 0 high CLOSED [OSP13] Memory leak in Pyroute2 2021-02-22 00:41:40 UTC

Description Sai Sindhur Malleni 2019-07-16 13:54:59 UTC
Description of problem:
When running a scenario that spawns several hundred HA routers in OSP, ML2/OVS we are seeing OOM kick in on the controllers leading to total destruction of the overcloud as RabbitMQ, Galera and Redis are killed due to lack of memory.

The scenario involved creating and listing routers, 1500 times at a concurrency of 16 in Rally (orchestrated through browbeat).

So the workflow involved
Create a network
Create two subnets on the entwork
Create two routers
Attach each router to each subnet
List routers

a total of 1500 times

We see that during HA router creation (3000 routers total, as 1500*2), the RSS memory of neutron-keepalived-state-change goes to around 35G, but continues to grow to about 85G when the router are left around (doing nothing)
In the below graph, routers were created from about 02:14 to 02:50 and then left around. We can see that the memory consumed by neutron-keepalived-state-change processes continues to grow after that

Controller-0 example
https://snapshot.raintank.io/dashboard/snapshot/MKJ59cXKXwdOdmDup0xcAjhYilAJ1EnC

You can see the overall free memory on the controller-0 reducing as well (256G total memory on the box)
https://snapshot.raintank.io/dashboard/snapshot/sjhYDSAJYRKkJRxr5r930eCyn0MoaJ2j

This happens on all three controllers..At around 04:45 all three controllers hit OOM leading to the cloud going to an unrecoverable state

Number of neutron-keepalived-state-change processes per controller (for total 3000 HA routers)

Controller-0: 1322
Controller-1: 1287
Controller-2: 1285

Version-Release number of selected component (if applicable):
OSP 13, z6

How reproducible:
100% when creating alrge number of HA routers

Steps to Reproduce:
1. Deploy OSP 13
2. Create close to 3k routers
3. Wait for OOM

Actual results:
OOM on all three controllers and cloud becomes unrecoverable

Expected results:
Should be able to support 3000 routers without memroy leak

Additional info:

Comment 6 Sai Sindhur Malleni 2019-08-19 16:13:00 UTC
Sure, so this is a WONTFIX because of architectural reasons?

Comment 8 Nate Johnston 2019-08-26 13:50:16 UTC
Marking this closed as the system is working as designed, and improvements will require internal rearchitecting that is outside the scope of a non-RFE bug.

Comment 9 Sai Sindhur Malleni 2019-10-23 08:30:49 UTC
I will raise an RFE separately.


Note You need to log in before you can comment on or make changes to this bug.