Description of problem: While testing a successfully deployed environment, launching a heat template to test it (using https://github.com/openstack/tripleo-ha-utils/tree/master/roles/validate-ha) while creating an instance. The environment is failing with: Stack CREATE FAILED (stack_test_ha_failed_actions): Resource CREATE failed: BadRequest: resources.instance: Block Device Mapping is Invalid: failed to get volume 0bec4d7d-4f5b-47a5-8e0e-5bf45d139fc5. (HTTP 400) Now what is strange is that the volume actually gets created (/var/log/containers/cinder/cinder-volume.log): 2018-10-25 14:21:55.129 66 INFO cinder.volume.flows.manager.create_volume [req-8f0ac906-e562-4687-9260-1881082f08a4 f66a66945cc6441c8906f5fcefb080ee 61e8865cb21f4e229cf8278713a7e39a - default default] Volume volume-0bec4d7d-4f5b-47a5-8e0e-5bf45d139fc5 (0bec4d7d-4f5b-47a5-8e0e-5bf45d139fc5): created successfully But then (/var/log/containers/nova/nova-api.log): 2018-10-25 14:22:08.975 16 INFO nova.api.openstack.wsgi [req-90218002-0219-4ba4-84d0-0960242a1106 - - - - -] HTTP exception thrown: Block Device Mapping is Invalid: failed to get volume 0bec4d7d-4f5b-47a5-8e0e-5bf45d139fc5. And everything ends up with the instance creation failure (/var/log/containers/nova/nova-conductor.log): 2018-10-25 14:22:00.010 1 ERROR nova.conductor.manager [req-7d24a7ff-c533-41f3-8756-5efd2742b4b0 f66a66945cc6441c8906f5fcefb080ee 61e8865cb21f4e229cf8278713a7e39a - defa ult default] Failed to schedule instances: NoValidHost_Remote: No valid host was found. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 226, in inner return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/scheduler/manager.py", line 154, in select_destinations raise exception.NoValidHost(reason="") NoValidHost: No valid host was found. Version-Release number of selected component (if applicable): Puddle 2018-10-25.3 How reproducible: The problem is consistent. Steps to Reproduce: 1. Deploy an OSP14 environment; 2. Create a heat template (like https://github.com/openstack/tripleo-ha-utils/blob/master/roles/validate-ha/templates/validate-ha-heat-template.yaml.j2) that contains an instance for which the volume was previously created; Actual results: Instance fails to create. Expected results: Instance and heat template get created successfully. Additional info: sosreport to come in short.
The sosreports don't contain any cinder or nova logs. Not sure what happened, but could you capture them manually?
I examined the logs on the overcloud. Cinder's logs show the volume 0bec4d7d-4f5b-47a5-8e0e-5bf45d139fc5 is successfully created, but is then deleted a few seconds later. I believe this is triggered by nova cleaning things up after the instance could not be spawned. I then found this in the nova-scheduler.log on controller-0. 2018-10-25 14:21:59.994 26 ERROR nova.scheduler.client.report [req-7d24a7ff-c533-41f3-8756-5efd2742b4b0 f66a66945cc6441c8906f5fcefb080ee 61e8865cb21f4e229cf8278713a7e39a - default default] Failed to retrieve allocation candidates from placement API for filters: RequestGroup(use_same_provider=False, resources={MEMORY_MB:2048, VCPU:2}, traits=[], aggregates=[]) Got 500: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>500 Internal Server Error</title> </head><body> <h1>Internal Server Error</h1> <p>The server encountered an internal error or misconfiguration and was unable to complete your request.</p> <p>Please contact the server administrator at [no address given] to inform them of the time this error occurred, and the actions you performed just before this error.</p> <p>More information about this error may be available in the server error log.</p> </body></html> . 2018-10-25 14:21:59.995 26 INFO nova.scheduler.manager [req-7d24a7ff-c533-41f3-8756-5efd2742b4b0 f66a66945cc6441c8906f5fcefb080ee 61e8865cb21f4e229cf8278 This is during the time period between when cinder created and later deleted the volume. Moving over to Compute as this does not appear to be a cinder issue.
Martin, could this be another manifestation of the placement db_sync bug?
(In reply to Matthew Booth from comment #4) > Martin, could this be another manifestation of the placement db_sync bug? Yes, this is another issue due to the deployment race, which we fix with BZ1636399 to make sure nova-placement gets started on ctrl1+2 after the nova api db_sync finished on ctr0: * haproxy nova_placement proxy with no backend servers Oct 25 10:02:33 overcloud-controller-0 haproxy[11]: proxy nova_placement has no server available! * placement started on controller-1+2 Oct 25 10:04:27 overcloud-controller-0 haproxy[11]: Server nova_placement/overcloud-controller-2.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Oct 25 10:04:31 overcloud-controller-0 haproxy[11]: Server nova_placement/overcloud-controller-1.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. * on controller-0 where we run the api db_sync placement gets started Oct 25 10:09:33 overcloud-controller-0 haproxy[11]: Server nova_placement/overcloud-controller-0.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 0ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. From /var/log/containers/httpd/nova-placement/placement_wsgi_error.log : * controller-1: [Tue Oct 30 00:01:25.983130 2018] [:error] [pid 15] [remote 172.17.0.11:120] mod_wsgi (pid=15): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module. [Tue Oct 30 00:01:25.983164 2018] [:error] [pid 15] [remote 172.17.0.11:120] mod_wsgi (pid=15): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'. [Tue Oct 30 00:01:25.983191 2018] [:error] [pid 15] [remote 172.17.0.11:120] Traceback (most recent call last): [Tue Oct 30 00:01:25.983204 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module> [Tue Oct 30 00:01:25.983228 2018] [:error] [pid 15] [remote 172.17.0.11:120] application = init_application() [Tue Oct 30 00:01:25.983235 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 108, in init_application [Tue Oct 30 00:01:25.983254 2018] [:error] [pid 15] [remote 172.17.0.11:120] db_api.configure(conf.CONF) [Tue Oct 30 00:01:25.983259 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/db_api.py", line 35, in configure [Tue Oct 30 00:01:25.983272 2018] [:error] [pid 15] [remote 172.17.0.11:120] **_get_db_conf(conf.placement_database)) [Tue Oct 30 00:01:25.983278 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 788, in configure [Tue Oct 30 00:01:25.983309 2018] [:error] [pid 15] [remote 172.17.0.11:120] self._factory.configure(**kw) [Tue Oct 30 00:01:25.983319 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/debtcollector/renames.py", line 43, in decorator [Tue Oct 30 00:01:25.983333 2018] [:error] [pid 15] [remote 172.17.0.11:120] return wrapped(*args, **kwargs) [Tue Oct 30 00:01:25.983339 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 312, in configure [Tue Oct 30 00:01:25.983350 2018] [:error] [pid 15] [remote 172.17.0.11:120] self._configure(False, kw) [Tue Oct 30 00:01:25.983355 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 317, in _configure [Tue Oct 30 00:01:25.983365 2018] [:error] [pid 15] [remote 172.17.0.11:120] raise TypeError("this TransactionFactory is already started") [Tue Oct 30 00:01:25.983378 2018] [:error] [pid 15] [remote 172.17.0.11:120] TypeError: this TransactionFactory is already started [Tue Oct 30 00:01:37.259985 2018] [:error] [pid 15] [remote 172.17.0.14:120] mod_wsgi (pid=15): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module. * controller-2: [Tue Oct 30 00:00:42.921489 2018] [:error] [pid 15] [remote 172.17.0.11:72] mod_wsgi (pid=15): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module. [Tue Oct 30 00:00:42.921517 2018] [:error] [pid 15] [remote 172.17.0.11:72] mod_wsgi (pid=15): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'. [Tue Oct 30 00:00:42.921541 2018] [:error] [pid 15] [remote 172.17.0.11:72] Traceback (most recent call last): [Tue Oct 30 00:00:42.921559 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module> [Tue Oct 30 00:00:42.921592 2018] [:error] [pid 15] [remote 172.17.0.11:72] application = init_application() [Tue Oct 30 00:00:42.921603 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 108, in init_application [Tue Oct 30 00:00:42.921632 2018] [:error] [pid 15] [remote 172.17.0.11:72] db_api.configure(conf.CONF) [Tue Oct 30 00:00:42.921642 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/db_api.py", line 35, in configure [Tue Oct 30 00:00:42.921663 2018] [:error] [pid 15] [remote 172.17.0.11:72] **_get_db_conf(conf.placement_database)) [Tue Oct 30 00:00:42.921673 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 788, in configure [Tue Oct 30 00:00:42.921695 2018] [:error] [pid 15] [remote 172.17.0.11:72] self._factory.configure(**kw) [Tue Oct 30 00:00:42.921704 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/debtcollector/renames.py", line 43, in decorator [Tue Oct 30 00:00:42.921725 2018] [:error] [pid 15] [remote 172.17.0.11:72] return wrapped(*args, **kwargs) [Tue Oct 30 00:00:42.921735 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 312, in configure [Tue Oct 30 00:00:42.921755 2018] [:error] [pid 15] [remote 172.17.0.11:72] self._configure(False, kw) [Tue Oct 30 00:00:42.921764 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 317, in _configure [Tue Oct 30 00:00:42.921784 2018] [:error] [pid 15] [remote 172.17.0.11:72] raise TypeError("this TransactionFactory is already started") [Tue Oct 30 00:00:42.921799 2018] [:error] [pid 15] [remote 172.17.0.11:72] TypeError: this TransactionFactory is already started Only controller-0 shows no errors where nova-placement started correct after the api db_sync. After a restart of nova_placement container on ctrl1+2 we can start instances on the overcloud. This is fixed with BZ1636399 / openstack-tripleo-heat-templates-9.0.1-0.20181013060866.ffbe879.el7ost where we make sure nova-placement gets started on ctrl1+2 after the nova api db_sync finished on ctr0.
*** This bug has been marked as a duplicate of bug 1636399 ***