Created attachment 1139744 [details] Database Cleanup Script Description of problem: Nova has no process to clean up records for deleted instances from the database. This will just continue to grow. A side effect of this is that when logging into Horizon, login times also continue to grow slower. With 45,000 deleted instance records in the Nova database it will take longer than 1 minute to login and this will result in HAProxy timeouts. Note it appears this is being addressed in Mitaka: https://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/purge-deleted-instances-cmd.html Version-Release number of selected component (if applicable): RHEL OSP 7 (Deployed via Director) openstack-nova-api-2015.1.2-13.el7ost.noarch openstack-dashboard-2015.1.2-4.el7ost.noarch How reproducible: Not tested, but likely you can reproduce by creating and deleting 45,000 VMs. Rally was used to generate and delete the VMs. If you use Rally for benchmarking for a few weeks you'll likely run into this problem. Steps to Reproduce: 1. Time your login to Horizon 2. Use Rally to gradually create/delete 45,000 VMs 3. Try to login to Horizon again and note the time (and potentially the Gateway Timeout from HAProxy) Actual results: Expected results: Ideally the SQL queries used to calculate the data for the Overview screen would be tuned so they could still function with a database this size. However, just removing deleted instances also works Additional info: I've attached a DB cleanup script used to reduce the instance count
I think this is a Horizon bug. Soft-deleted instances should have no effect on displaying information about existing instances, as the filtering would happen on the database side. It certainly shouldn't take more than a minute to filter 45k entries on a boolean field.
For the soft-deleted instances to negatively affect the performance of Horizon, it would have to specifically ask Nova to provide the list of deleted instances. If Horizon indeed does that, this is a bug that should be solved in Horizon. I'm re-categorizing this bug to Horizon, so that the developers can check for such queries.
Which page are you visiting in Horizon, where you were spotting the timeout?
Can confirm that with around 45000 deleted instances and accessing the admin->overview page, the request takes around 21 seconds in a local network (so no network delays that could affect the timing) Main issue seems to come from a call that we do to novaclient.usage.list in which we ask a detailed vies, which includes all the deleted instances. Unfortunately, the overview of the admin needs this data to provide a proper view. There is a patch upstream for Neutron that allows to configure the overview range to 1 day, thus diminishing the issues that occurs with a large number of deleted instances. https://review.openstack.org/#/c/238204/
But in the end, this looks like its the intended output, a display of all instances, even deleted ones.
Im closing this and following this issue on bz 1329414 that has opened the issue upstream. *** This bug has been marked as a duplicate of bug 1329414 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days