1533196 – nova-scheduler reports dead compute nodes but nova-compute is enabled and up

RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/

Bug 1533196 - nova-scheduler reports dead compute nodes but nova-compute is enabled and up

Summary: nova-scheduler reports dead compute nodes but nova-compute is enabled and up

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	RDO
Classification:	Community
Component:	openstack-nova
Sub Component:
Version:	Ocata
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	trunk
Assignee:	Eoghan Glynn
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-10 17:18 UTC by David Manchado
Modified:	2018-01-12 18:04 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-01-12 00:35:04 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1742827	0	None	None	None	2018-01-12 00:35:04 UTC

Description David Manchado 2018-01-10 17:18:34 UTC

Description of problem:
We are seeing that nova scheduler is removing compute nodes because it considers them as dead but openstack compute service list reports nova-compute to be up an running.
We can see in nova-scheduler entries with the following pattern:
- Removing dead compute node XXX from scheduler
- Filter ComputeFilter returned 0 hosts
- Filtering removed all hosts for the request with instance ID '11feeba9-f46c-416d-a97e-7c0c9d565b5a'. Filter results: ['AggregateInstanceExtraSpecsFilter: (start: 19, end: 2)', 'AggregateCoreFilter: (start: 2, end: 2)', 'AggregateDiskFilter: (start: 2, end: 2)', 'AggregateRamFilter: (start: 2, end: 2)', 'RetryFilter: (start: 2, end: 2)', 'AvailabilityZoneFilter: (start: 2, end: 2)', 'ComputeFilter: (start: 2, end: 0)']


Version-Release number of selected component (if applicable):
Ocata


How reproducible:
N/A

Steps to Reproduce:
1.
2.
3.

Actual results:
Instances are not being spawned reporting 'no valid host found' because of 

Expected results:


Additional info:
This has been happening for a week.
We did an upgrade from Newton three weeks ago.
We have also done a minor update and the issue still persists.

Nova related RPMs
openstack-nova-scheduler-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
python2-novaclient-7.1.2-1.el7.noarch
openstack-nova-novncproxy-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-cert-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-console-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-conductor-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-common-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-compute-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-placement-api-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
puppet-nova-10.4.2-0.20180102233330.f4bc1f0.el7.centos.noarch
openstack-nova-api-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
python-nova-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch

Comment 1 Alan Pevec 2018-01-10 17:44:54 UTC

@lyarwood @eglynn this is reported in RDO Cloud, how do you want to handle it?
If it's not a packaging issue it should be moved upstream, but we don't want it to get lost there.

Comment 2 wes hayutin 2018-01-11 15:00:00 UTC

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-pike/3a9d14b/console.txt.gz

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp_1ceph-featureset024-pike/1ccf4eb/console.txt.gz

Comment 3 melanie witt 2018-01-11 23:43:05 UTC

It's difficult to guess what's going wrong without being able to see the nova-* service logs at the time the failure occurs. I'll try to make a guess based on the info you've given.

You mentioned the cluster was upgraded to Ocata three weeks ago and this problem has been happening for one week. Have there been any changes in the compute hosts? That is, have any been swapped in or out of the cluster?

Starting in Ocata, there's a concept in Nova called Cells v2 and at deployment time, there are a few nova-manage commands that are required [1] to setup the mappings needed for API-related services like the scheduler to find compute hosts that are in a cell.

If you make any changes to compute hosts, namely adding new ones, you have to run 'nova-manage cell_v2 discover_hosts' in order to make them visible to the API and allow scheduling to them.

The cluster will have 3 databases: nova_cell0, nova_api, and nova. In the nova_api database, you should see all of your compute hosts in the host_mappings table. If any are missing, you need to run the discover_hosts command and then you should see them appear in host_mappings. In the nova_api database, you should see two cells in the cell_mappings table: one called 'cell0' (for instances that failed to schedule) containing its database connection and another probably called 'cell1' containing its database and message queue connections.

If all looks fine there, we'll need to take a look at the nova-* service logs for the failure to dig deeper.

[1] https://docs.openstack.org/nova/latest/user/cells.html#upgrade-minimal

Comment 4 Alan Pevec (Fedora) 2018-01-12 00:35:04 UTC

Since it's not RDO packaging issue, moved upstream https://bugs.launchpad.net/nova/+bug/1742827

Note You need to log in before you can comment on or make changes to this bug.