Bug 1320724

Summary: Nova does not clean up deleted instances which severely impacts horizon performance
Product: Red Hat OpenStack Reporter: Jon Jozwiak <jjozwiak>
Component: python-django-horizonAssignee: Itxaka <iserrano>
Status: CLOSED DUPLICATE QA Contact: Ido Ovadia <iovadia>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: aortega, athomas, berrange, dasmith, david.costakos, eglynn, iserrano, jduncan, jjozwiak, kchamart, mrunge, rdopiera, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone: ---Keywords: ZStream
Target Release: 8.0 (Liberty)Flags: mrunge: needinfo? (jjozwiak)
iserrano: needinfo? (jjozwiak)
iserrano: needinfo? (jjozwiak)
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-22 08:21:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Database Cleanup Script none

Description Jon Jozwiak 2016-03-23 20:09:12 UTC
Created attachment 1139744 [details]
Database Cleanup Script

Description of problem:
Nova has no process to clean up records for deleted instances from the database.  This will just continue to grow.  A side effect of this is that when logging into Horizon, login times also continue to grow slower.  With 45,000 deleted instance records in the Nova database it will take longer than 1 minute to login and this will result in HAProxy timeouts.  

Note it appears this is being addressed in Mitaka:
https://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/purge-deleted-instances-cmd.html

Version-Release number of selected component (if applicable):
RHEL OSP 7 (Deployed via Director)
openstack-nova-api-2015.1.2-13.el7ost.noarch
openstack-dashboard-2015.1.2-4.el7ost.noarch

How reproducible:
Not tested, but likely you can reproduce by creating and deleting 45,000 VMs.  Rally was used to generate and delete the VMs.  If you use Rally for benchmarking for a few weeks you'll likely run into this problem.  

Steps to Reproduce:
1. Time your login to Horizon 
2. Use Rally to gradually create/delete 45,000 VMs
3. Try to login to Horizon again and note the time (and potentially the Gateway Timeout from HAProxy)

Actual results:


Expected results:
Ideally the SQL queries used to calculate the data for the Overview screen would be tuned so they could still function with a database this size.  However, just removing deleted instances also works

Additional info:
I've attached a DB cleanup script used to reduce the instance count

Comment 2 Radomir Dopieralski 2016-04-01 11:33:51 UTC
I think this is a Horizon bug. Soft-deleted instances should have no effect on displaying information about existing instances, as the filtering would happen on the database side. It certainly shouldn't take more than a minute to filter 45k entries on a boolean field.

Comment 3 Radomir Dopieralski 2016-04-01 14:34:23 UTC
For the soft-deleted instances to negatively affect the performance of Horizon, it would have to specifically ask Nova to provide the list of deleted instances. If Horizon indeed does that, this is a bug that should be solved in Horizon. I'm re-categorizing this bug to Horizon, so that the developers can check for such queries.

Comment 4 Matthias Runge 2016-04-04 07:53:04 UTC
Which page are you visiting in Horizon, where you were spotting the timeout?

Comment 6 Itxaka 2016-04-18 13:51:09 UTC
Can confirm that with around 45000 deleted instances and accessing the admin->overview page, the request takes around 21 seconds in a local network (so no network delays that could affect the timing)

Main issue seems to come from a call that we do to novaclient.usage.list in which we ask a detailed vies, which includes all the deleted instances.


Unfortunately, the overview of the admin needs this data to provide a proper view.

There is a patch upstream for Neutron that allows to configure the overview range to 1 day, thus diminishing the issues that occurs with a large number of deleted instances.
https://review.openstack.org/#/c/238204/

Comment 7 Itxaka 2016-04-18 14:09:25 UTC
But in the end, this looks like its the intended output, a display of all instances, even deleted ones.

Comment 8 Itxaka 2016-04-22 08:21:56 UTC
Im closing this and following this issue on bz 1329414 that has opened the issue upstream.

*** This bug has been marked as a duplicate of bug 1329414 ***