Bug 1643147

Summary:	Unable to spawn an instance on a new env due to volume retrieve failure
Product:	Red Hat OpenStack	Reporter:	Raoul Scarazzini <rscarazz>
Component:	openstack-nova	Assignee:	OSP DFG:Compute <osp-dfg-compute>
Status:	CLOSED DUPLICATE	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	14.0 (Rocky)	CC:	abishop, berrange, cschwede, dasmith, eglynn, fpercoco, jhakimra, kchamart, mbooth, mschuppe, rscarazz, sbauza, sferdjao, sgordon, vromanso
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-01 15:40:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Raoul Scarazzini 2018-10-25 15:17:44 UTC

Description of problem:

While testing a successfully deployed environment, launching a heat template to test it (using https://github.com/openstack/tripleo-ha-utils/tree/master/roles/validate-ha) while creating an instance.
The environment is failing with:

Stack CREATE FAILED (stack_test_ha_failed_actions): Resource CREATE failed: BadRequest: resources.instance: Block Device Mapping is Invalid: failed to get volume 0bec4d7d-4f5b-47a5-8e0e-5bf45d139fc5. (HTTP 400)

Now what is strange is that the volume actually gets created (/var/log/containers/cinder/cinder-volume.log):

2018-10-25 14:21:55.129 66 INFO cinder.volume.flows.manager.create_volume [req-8f0ac906-e562-4687-9260-1881082f08a4 f66a66945cc6441c8906f5fcefb080ee 61e8865cb21f4e229cf8278713a7e39a - default default] Volume volume-0bec4d7d-4f5b-47a5-8e0e-5bf45d139fc5 (0bec4d7d-4f5b-47a5-8e0e-5bf45d139fc5): created successfully

But then (/var/log/containers/nova/nova-api.log):

2018-10-25 14:22:08.975 16 INFO nova.api.openstack.wsgi [req-90218002-0219-4ba4-84d0-0960242a1106 - - - - -] HTTP exception thrown: Block Device Mapping is Invalid: failed to get volume 0bec4d7d-4f5b-47a5-8e0e-5bf45d139fc5.

And everything ends up with the instance creation failure (/var/log/containers/nova/nova-conductor.log):

2018-10-25 14:22:00.010 1 ERROR nova.conductor.manager [req-7d24a7ff-c533-41f3-8756-5efd2742b4b0 f66a66945cc6441c8906f5fcefb080ee 61e8865cb21f4e229cf8278713a7e39a - defa
ult default] Failed to schedule instances: NoValidHost_Remote: No valid host was found. 
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 226, in inner
    return func(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/nova/scheduler/manager.py", line 154, in select_destinations
    raise exception.NoValidHost(reason="")

NoValidHost: No valid host was found.

Version-Release number of selected component (if applicable):

Puddle 2018-10-25.3

How reproducible:

The problem is consistent.

Steps to Reproduce:
1. Deploy an OSP14 environment;
2. Create a heat template (like https://github.com/openstack/tripleo-ha-utils/blob/master/roles/validate-ha/templates/validate-ha-heat-template.yaml.j2) that contains an instance for which the volume was previously created;

Actual results:

Instance fails to create.

Expected results:

Instance and heat template get created successfully.

Additional info:

sosreport to come in short.

Comment 2 Alan Bishop 2018-10-29 12:46:20 UTC

The sosreports don't contain any cinder or nova logs. Not sure what happened, but could you capture them manually?

Comment 3 Alan Bishop 2018-10-29 15:47:56 UTC

I examined the logs on the overcloud. Cinder's logs show the volume 0bec4d7d-4f5b-47a5-8e0e-5bf45d139fc5 is successfully created, but is then deleted a few seconds later.

I believe this is triggered by nova cleaning things up after the instance could not be spawned. I then found this in the nova-scheduler.log on controller-0.

2018-10-25 14:21:59.994 26 ERROR nova.scheduler.client.report [req-7d24a7ff-c533-41f3-8756-5efd2742b4b0 f66a66945cc6441c8906f5fcefb080ee 61e8865cb21f4e229cf8278713a7e39a - default default] Failed to retrieve allocation candidates from placement API for filters: RequestGroup(use_same_provider=False, resources={MEMORY_MB:2048, VCPU:2}, traits=[], aggregates=[])
Got 500: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator at
 [no address given] to inform them of the time this error occurred,
 and the actions you performed just before this error.</p>
<p>More information about this error may be available
in the server error log.</p>
</body></html>
.
2018-10-25 14:21:59.995 26 INFO nova.scheduler.manager [req-7d24a7ff-c533-41f3-8756-5efd2742b4b0 f66a66945cc6441c8906f5fcefb080ee 61e8865cb21f4e229cf8278

This is during the time period between when cinder created and later deleted the volume.

Moving over to Compute as this does not appear to be a cinder issue.

Comment 4 Matthew Booth 2018-10-29 17:33:13 UTC

Martin, could this be another manifestation of the placement db_sync bug?

Comment 5 Martin Schuppert 2018-10-30 08:46:41 UTC

(In reply to Matthew Booth from comment #4)
> Martin, could this be another manifestation of the placement db_sync bug?

Yes, this is another issue due to the deployment race, which we fix with BZ1636399 to make sure nova-placement gets started on ctrl1+2 after the nova api db_sync finished on ctr0:

* haproxy nova_placement proxy with no backend servers
Oct 25 10:02:33 overcloud-controller-0 haproxy[11]: proxy nova_placement has no server available!

* placement started on controller-1+2
Oct 25 10:04:27 overcloud-controller-0 haproxy[11]: Server nova_placement/overcloud-controller-2.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Oct 25 10:04:31 overcloud-controller-0 haproxy[11]: Server nova_placement/overcloud-controller-1.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

* on controller-0 where we run the api db_sync placement gets started 
Oct 25 10:09:33 overcloud-controller-0 haproxy[11]: Server nova_placement/overcloud-controller-0.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 0ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.


From /var/log/containers/httpd/nova-placement/placement_wsgi_error.log :

* controller-1:
[Tue Oct 30 00:01:25.983130 2018] [:error] [pid 15] [remote 172.17.0.11:120] mod_wsgi (pid=15): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module.
[Tue Oct 30 00:01:25.983164 2018] [:error] [pid 15] [remote 172.17.0.11:120] mod_wsgi (pid=15): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'.
[Tue Oct 30 00:01:25.983191 2018] [:error] [pid 15] [remote 172.17.0.11:120] Traceback (most recent call last):
[Tue Oct 30 00:01:25.983204 2018] [:error] [pid 15] [remote 172.17.0.11:120]   File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module>
[Tue Oct 30 00:01:25.983228 2018] [:error] [pid 15] [remote 172.17.0.11:120]     application = init_application()
[Tue Oct 30 00:01:25.983235 2018] [:error] [pid 15] [remote 172.17.0.11:120]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 108, in init_application
[Tue Oct 30 00:01:25.983254 2018] [:error] [pid 15] [remote 172.17.0.11:120]     db_api.configure(conf.CONF)
[Tue Oct 30 00:01:25.983259 2018] [:error] [pid 15] [remote 172.17.0.11:120]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/db_api.py", line 35, in configure
[Tue Oct 30 00:01:25.983272 2018] [:error] [pid 15] [remote 172.17.0.11:120]     **_get_db_conf(conf.placement_database))
[Tue Oct 30 00:01:25.983278 2018] [:error] [pid 15] [remote 172.17.0.11:120]   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 788, in configure
[Tue Oct 30 00:01:25.983309 2018] [:error] [pid 15] [remote 172.17.0.11:120]     self._factory.configure(**kw)
[Tue Oct 30 00:01:25.983319 2018] [:error] [pid 15] [remote 172.17.0.11:120]   File "/usr/lib/python2.7/site-packages/debtcollector/renames.py", line 43, in decorator
[Tue Oct 30 00:01:25.983333 2018] [:error] [pid 15] [remote 172.17.0.11:120]     return wrapped(*args, **kwargs)
[Tue Oct 30 00:01:25.983339 2018] [:error] [pid 15] [remote 172.17.0.11:120]   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 312, in configure
[Tue Oct 30 00:01:25.983350 2018] [:error] [pid 15] [remote 172.17.0.11:120]     self._configure(False, kw)
[Tue Oct 30 00:01:25.983355 2018] [:error] [pid 15] [remote 172.17.0.11:120]   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 317, in _configure
[Tue Oct 30 00:01:25.983365 2018] [:error] [pid 15] [remote 172.17.0.11:120]     raise TypeError("this TransactionFactory is already started")
[Tue Oct 30 00:01:25.983378 2018] [:error] [pid 15] [remote 172.17.0.11:120] TypeError: this TransactionFactory is already started
[Tue Oct 30 00:01:37.259985 2018] [:error] [pid 15] [remote 172.17.0.14:120] mod_wsgi (pid=15): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module.

* controller-2:
[Tue Oct 30 00:00:42.921489 2018] [:error] [pid 15] [remote 172.17.0.11:72] mod_wsgi (pid=15): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module.
[Tue Oct 30 00:00:42.921517 2018] [:error] [pid 15] [remote 172.17.0.11:72] mod_wsgi (pid=15): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'.
[Tue Oct 30 00:00:42.921541 2018] [:error] [pid 15] [remote 172.17.0.11:72] Traceback (most recent call last):
[Tue Oct 30 00:00:42.921559 2018] [:error] [pid 15] [remote 172.17.0.11:72]   File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module>
[Tue Oct 30 00:00:42.921592 2018] [:error] [pid 15] [remote 172.17.0.11:72]     application = init_application()
[Tue Oct 30 00:00:42.921603 2018] [:error] [pid 15] [remote 172.17.0.11:72]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 108, in init_application
[Tue Oct 30 00:00:42.921632 2018] [:error] [pid 15] [remote 172.17.0.11:72]     db_api.configure(conf.CONF)
[Tue Oct 30 00:00:42.921642 2018] [:error] [pid 15] [remote 172.17.0.11:72]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/db_api.py", line 35, in configure
[Tue Oct 30 00:00:42.921663 2018] [:error] [pid 15] [remote 172.17.0.11:72]     **_get_db_conf(conf.placement_database))
[Tue Oct 30 00:00:42.921673 2018] [:error] [pid 15] [remote 172.17.0.11:72]   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 788, in configure
[Tue Oct 30 00:00:42.921695 2018] [:error] [pid 15] [remote 172.17.0.11:72]     self._factory.configure(**kw)
[Tue Oct 30 00:00:42.921704 2018] [:error] [pid 15] [remote 172.17.0.11:72]   File "/usr/lib/python2.7/site-packages/debtcollector/renames.py", line 43, in decorator
[Tue Oct 30 00:00:42.921725 2018] [:error] [pid 15] [remote 172.17.0.11:72]     return wrapped(*args, **kwargs)
[Tue Oct 30 00:00:42.921735 2018] [:error] [pid 15] [remote 172.17.0.11:72]   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 312, in configure
[Tue Oct 30 00:00:42.921755 2018] [:error] [pid 15] [remote 172.17.0.11:72]     self._configure(False, kw)
[Tue Oct 30 00:00:42.921764 2018] [:error] [pid 15] [remote 172.17.0.11:72]   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 317, in _configure
[Tue Oct 30 00:00:42.921784 2018] [:error] [pid 15] [remote 172.17.0.11:72]     raise TypeError("this TransactionFactory is already started")
[Tue Oct 30 00:00:42.921799 2018] [:error] [pid 15] [remote 172.17.0.11:72] TypeError: this TransactionFactory is already started

Only controller-0 shows no errors where nova-placement started correct after the api db_sync.

After a restart of nova_placement container on ctrl1+2 we can start instances on the overcloud. 

This is fixed with BZ1636399 / openstack-tripleo-heat-templates-9.0.1-0.20181013060866.ffbe879.el7ost where we make sure nova-placement gets started on ctrl1+2 after the nova api db_sync finished on ctr0.

Comment 6 Matthew Booth 2018-11-01 15:40:32 UTC


*** This bug has been marked as a duplicate of bug 1636399 ***