Bug 1730345

Summary: Memory Leak in neutron-keepalived-state-change when creating HA routers leads to OOM on all three controllers
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: openstack-neutronAssignee: Rodolfo Alonso <ralonsoh>
Status: CLOSED WONTFIX QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: amuller, bperkins, chrisw, njohnston, racedo, ralonsoh, scohen
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-26 13:50:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2019-07-16 13:54:59 UTC
Description of problem:
When running a scenario that spawns several hundred HA routers in OSP, ML2/OVS we are seeing OOM kick in on the controllers leading to total destruction of the overcloud as RabbitMQ, Galera and Redis are killed due to lack of memory.

The scenario involved creating and listing routers, 1500 times at a concurrency of 16 in Rally (orchestrated through browbeat).

So the workflow involved
Create a network
Create two subnets on the entwork
Create two routers
Attach each router to each subnet
List routers

a total of 1500 times

We see that during HA router creation (3000 routers total, as 1500*2), the RSS memory of neutron-keepalived-state-change goes to around 35G, but continues to grow to about 85G when the router are left around (doing nothing)
In the below graph, routers were created from about 02:14 to 02:50 and then left around. We can see that the memory consumed by neutron-keepalived-state-change processes continues to grow after that

Controller-0 example
https://snapshot.raintank.io/dashboard/snapshot/MKJ59cXKXwdOdmDup0xcAjhYilAJ1EnC

You can see the overall free memory on the controller-0 reducing as well (256G total memory on the box)
https://snapshot.raintank.io/dashboard/snapshot/sjhYDSAJYRKkJRxr5r930eCyn0MoaJ2j

This happens on all three controllers..At around 04:45 all three controllers hit OOM leading to the cloud going to an unrecoverable state

Number of neutron-keepalived-state-change processes per controller (for total 3000 HA routers)

Controller-0: 1322
Controller-1: 1287
Controller-2: 1285

Version-Release number of selected component (if applicable):
OSP 13, z6

How reproducible:
100% when creating alrge number of HA routers

Steps to Reproduce:
1. Deploy OSP 13
2. Create close to 3k routers
3. Wait for OOM

Actual results:
OOM on all three controllers and cloud becomes unrecoverable

Expected results:
Should be able to support 3000 routers without memroy leak

Additional info:

Comment 6 Sai Sindhur Malleni 2019-08-19 16:13:00 UTC
Sure, so this is a WONTFIX because of architectural reasons?

Comment 8 Nate Johnston 2019-08-26 13:50:16 UTC
Marking this closed as the system is working as designed, and improvements will require internal rearchitecting that is outside the scope of a non-RFE bug.

Comment 9 Sai Sindhur Malleni 2019-10-23 08:30:49 UTC
I will raise an RFE separately.