962861 – removing compute nodes from cloud prevents VMs from launching

Bug 962861 - removing compute nodes from cloud prevents VMs from launching

Summary: removing compute nodes from cloud prevents VMs from launching

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	2.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.0
Assignee:	Russell Bryant
QA Contact:	Ami Jeain
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-14 15:33 UTC by Dan Yocum
Modified:	2019-09-09 15:05 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-11-14 15:37:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
nova.conf from OS1 (65.59 KB, text/plain) 2013-05-14 16:40 UTC, Dan Yocum	no flags	Details
View All

Description Dan Yocum 2013-05-14 15:33:49 UTC

Description of problem:
When compute nodes are removed from a currently running cloud, launching new VMs fails.

Version-Release number of selected component (if applicable):

Folsom

How reproducible:

Always

Steps to Reproduce:
1. Create cloud with X compute nodes
2. Use cloud - launch/terminate VMs, 'nova-manage service list', etc.
3. Disable N compute nodes
4. Try to use the cloud and watch VMs fail to start.
  
Actual results:

VMs enter error state, fail to start

Expected results:

VMs start!

Additional info:

After I removed 24 compute nodes from our 64 node cloud (turned off and disabled all openstack-* services), I did 'mysql -B -e "delete from nova.services;"' to clean out the services table.  

I could only get the cloud to launch VMs again after running 'mysql -B -e "delete from nova.compute_nodes;"' and letting it re-populate.

Comment 2 Russell Bryant 2013-05-14 15:51:06 UTC

Can you share your configuration?  The choice of scheduler driver, and in the case of the filter scheduler, which filters are enabled will affect what the expected behavior is here.

Comment 3 Dan Yocum 2013-05-14 16:40:23 UTC

Created attachment 747763 [details]
nova.conf from OS1

Comment 4 Dan Yocum 2013-05-14 16:41:34 UTC

nova.conf attached - I haven't twiddled any of the scheduler knobs except for ram overcommit (2x).

Comment 5 Russell Bryant 2013-05-30 19:26:59 UTC

So here how this is *supposed* to work.  The default set of filters used by the filter scheduler is the ComputeFilter.  This filter is supposed to filter out compute nodes that are not currently up.  

Being "up" is determined by making a call to the internal servicegroup API, which has several optional backends.  The default backend is database driven.  If you turn on debug logging, you should be seeing an entry when it's checking to see if a service is up:

from nova/servicegroup/drivers/db.py:

        LOG.debug('DB_Driver.is_up last_heartbeat = %(lhb)s elapsed = %(el)s',
                  {'lhb': str(last_heartbeat), 'el': str(elapsed)})

Check the contents of the 'services' table in the nova database.  You should notice that the 'updated_at' column is getting periodically updated.  This is the value used to determine whether the service is up.

The amount of time to wait before considering a service as down is 60 seconds.  This can be changed by setting the service_down_time configuration option in nova.conf.

So, with all of that said ... how long are you waiting after shutting down the compute node?  It is expected to fail within the first minute, unless you decrease the setting for the option I mentioned.

Comment 7 Jason Dillaman 2013-07-11 14:47:37 UTC

I was only able to recreate this issue when deleting all entries from the service table as indicated in the bug description (mysql -B -e "delete from nova.services;").  The reason this causes an issue is because the non-enforced foreign key from compute_nodes.service_id is now pointing to a missing record.

Dan, I just want to verify that you deleted all services from the table and not just the services for the deleted compute hosts, correct?  May I ask where you read about this procedure?  The only documentation that I could find recommended deleting only the records from the services and compute_nodes tables that match the deleted hosts.

Comment 8 Dan Yocum 2013-07-11 15:09:24 UTC

You are correct, I did a complete delete from the nova.service table.  The schema should probably updated to prevent that sort of action by people like me with a little db knowledge.  :-D (a little knowledge can be a dangerous thing).

Thanks

Comment 10 Russell Bryant 2013-11-14 15:37:40 UTC

Based on the comments here, it sounds like this problem was due to hacking the db directly.

Note You need to log in before you can comment on or make changes to this bug.