Description of problem: When running a scenario that spawns several hundred HA routers in OSP, ML2/OVS we are seeing OOM kick in on the controllers leading to total destruction of the overcloud as RabbitMQ, Galera and Redis are killed due to lack of memory. The scenario involved creating and listing routers, 1500 times at a concurrency of 16 in Rally (orchestrated through browbeat). So the workflow involved Create a network Create two subnets on the entwork Create two routers Attach each router to each subnet List routers a total of 1500 times We see that during HA router creation (3000 routers total, as 1500*2), the RSS memory of neutron-keepalived-state-change goes to around 35G, but continues to grow to about 85G when the router are left around (doing nothing) In the below graph, routers were created from about 02:14 to 02:50 and then left around. We can see that the memory consumed by neutron-keepalived-state-change processes continues to grow after that Controller-0 example https://snapshot.raintank.io/dashboard/snapshot/MKJ59cXKXwdOdmDup0xcAjhYilAJ1EnC You can see the overall free memory on the controller-0 reducing as well (256G total memory on the box) https://snapshot.raintank.io/dashboard/snapshot/sjhYDSAJYRKkJRxr5r930eCyn0MoaJ2j This happens on all three controllers..At around 04:45 all three controllers hit OOM leading to the cloud going to an unrecoverable state Number of neutron-keepalived-state-change processes per controller (for total 3000 HA routers) Controller-0: 1322 Controller-1: 1287 Controller-2: 1285 Version-Release number of selected component (if applicable): OSP 13, z6 How reproducible: 100% when creating alrge number of HA routers Steps to Reproduce: 1. Deploy OSP 13 2. Create close to 3k routers 3. Wait for OOM Actual results: OOM on all three controllers and cloud becomes unrecoverable Expected results: Should be able to support 3000 routers without memroy leak Additional info:
Sure, so this is a WONTFIX because of architectural reasons?
Marking this closed as the system is working as designed, and improvements will require internal rearchitecting that is outside the scope of a non-RFE bug.
I will raise an RFE separately.