Bug 1643147
Summary: | Unable to spawn an instance on a new env due to volume retrieve failure | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Raoul Scarazzini <rscarazz> |
Component: | openstack-nova | Assignee: | OSP DFG:Compute <osp-dfg-compute> |
Status: | CLOSED DUPLICATE | QA Contact: | OSP DFG:Compute <osp-dfg-compute> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 14.0 (Rocky) | CC: | abishop, berrange, cschwede, dasmith, eglynn, fpercoco, jhakimra, kchamart, mbooth, mschuppe, rscarazz, sbauza, sferdjao, sgordon, vromanso |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-11-01 15:40:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Raoul Scarazzini
2018-10-25 15:17:44 UTC
The sosreports don't contain any cinder or nova logs. Not sure what happened, but could you capture them manually? I examined the logs on the overcloud. Cinder's logs show the volume 0bec4d7d-4f5b-47a5-8e0e-5bf45d139fc5 is successfully created, but is then deleted a few seconds later. I believe this is triggered by nova cleaning things up after the instance could not be spawned. I then found this in the nova-scheduler.log on controller-0. 2018-10-25 14:21:59.994 26 ERROR nova.scheduler.client.report [req-7d24a7ff-c533-41f3-8756-5efd2742b4b0 f66a66945cc6441c8906f5fcefb080ee 61e8865cb21f4e229cf8278713a7e39a - default default] Failed to retrieve allocation candidates from placement API for filters: RequestGroup(use_same_provider=False, resources={MEMORY_MB:2048, VCPU:2}, traits=[], aggregates=[]) Got 500: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>500 Internal Server Error</title> </head><body> <h1>Internal Server Error</h1> <p>The server encountered an internal error or misconfiguration and was unable to complete your request.</p> <p>Please contact the server administrator at [no address given] to inform them of the time this error occurred, and the actions you performed just before this error.</p> <p>More information about this error may be available in the server error log.</p> </body></html> . 2018-10-25 14:21:59.995 26 INFO nova.scheduler.manager [req-7d24a7ff-c533-41f3-8756-5efd2742b4b0 f66a66945cc6441c8906f5fcefb080ee 61e8865cb21f4e229cf8278 This is during the time period between when cinder created and later deleted the volume. Moving over to Compute as this does not appear to be a cinder issue. Martin, could this be another manifestation of the placement db_sync bug? (In reply to Matthew Booth from comment #4) > Martin, could this be another manifestation of the placement db_sync bug? Yes, this is another issue due to the deployment race, which we fix with BZ1636399 to make sure nova-placement gets started on ctrl1+2 after the nova api db_sync finished on ctr0: * haproxy nova_placement proxy with no backend servers Oct 25 10:02:33 overcloud-controller-0 haproxy[11]: proxy nova_placement has no server available! * placement started on controller-1+2 Oct 25 10:04:27 overcloud-controller-0 haproxy[11]: Server nova_placement/overcloud-controller-2.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. Oct 25 10:04:31 overcloud-controller-0 haproxy[11]: Server nova_placement/overcloud-controller-1.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. * on controller-0 where we run the api db_sync placement gets started Oct 25 10:09:33 overcloud-controller-0 haproxy[11]: Server nova_placement/overcloud-controller-0.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 0ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. From /var/log/containers/httpd/nova-placement/placement_wsgi_error.log : * controller-1: [Tue Oct 30 00:01:25.983130 2018] [:error] [pid 15] [remote 172.17.0.11:120] mod_wsgi (pid=15): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module. [Tue Oct 30 00:01:25.983164 2018] [:error] [pid 15] [remote 172.17.0.11:120] mod_wsgi (pid=15): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'. [Tue Oct 30 00:01:25.983191 2018] [:error] [pid 15] [remote 172.17.0.11:120] Traceback (most recent call last): [Tue Oct 30 00:01:25.983204 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module> [Tue Oct 30 00:01:25.983228 2018] [:error] [pid 15] [remote 172.17.0.11:120] application = init_application() [Tue Oct 30 00:01:25.983235 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 108, in init_application [Tue Oct 30 00:01:25.983254 2018] [:error] [pid 15] [remote 172.17.0.11:120] db_api.configure(conf.CONF) [Tue Oct 30 00:01:25.983259 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/db_api.py", line 35, in configure [Tue Oct 30 00:01:25.983272 2018] [:error] [pid 15] [remote 172.17.0.11:120] **_get_db_conf(conf.placement_database)) [Tue Oct 30 00:01:25.983278 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 788, in configure [Tue Oct 30 00:01:25.983309 2018] [:error] [pid 15] [remote 172.17.0.11:120] self._factory.configure(**kw) [Tue Oct 30 00:01:25.983319 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/debtcollector/renames.py", line 43, in decorator [Tue Oct 30 00:01:25.983333 2018] [:error] [pid 15] [remote 172.17.0.11:120] return wrapped(*args, **kwargs) [Tue Oct 30 00:01:25.983339 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 312, in configure [Tue Oct 30 00:01:25.983350 2018] [:error] [pid 15] [remote 172.17.0.11:120] self._configure(False, kw) [Tue Oct 30 00:01:25.983355 2018] [:error] [pid 15] [remote 172.17.0.11:120] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 317, in _configure [Tue Oct 30 00:01:25.983365 2018] [:error] [pid 15] [remote 172.17.0.11:120] raise TypeError("this TransactionFactory is already started") [Tue Oct 30 00:01:25.983378 2018] [:error] [pid 15] [remote 172.17.0.11:120] TypeError: this TransactionFactory is already started [Tue Oct 30 00:01:37.259985 2018] [:error] [pid 15] [remote 172.17.0.14:120] mod_wsgi (pid=15): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module. * controller-2: [Tue Oct 30 00:00:42.921489 2018] [:error] [pid 15] [remote 172.17.0.11:72] mod_wsgi (pid=15): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module. [Tue Oct 30 00:00:42.921517 2018] [:error] [pid 15] [remote 172.17.0.11:72] mod_wsgi (pid=15): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'. [Tue Oct 30 00:00:42.921541 2018] [:error] [pid 15] [remote 172.17.0.11:72] Traceback (most recent call last): [Tue Oct 30 00:00:42.921559 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module> [Tue Oct 30 00:00:42.921592 2018] [:error] [pid 15] [remote 172.17.0.11:72] application = init_application() [Tue Oct 30 00:00:42.921603 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 108, in init_application [Tue Oct 30 00:00:42.921632 2018] [:error] [pid 15] [remote 172.17.0.11:72] db_api.configure(conf.CONF) [Tue Oct 30 00:00:42.921642 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/db_api.py", line 35, in configure [Tue Oct 30 00:00:42.921663 2018] [:error] [pid 15] [remote 172.17.0.11:72] **_get_db_conf(conf.placement_database)) [Tue Oct 30 00:00:42.921673 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 788, in configure [Tue Oct 30 00:00:42.921695 2018] [:error] [pid 15] [remote 172.17.0.11:72] self._factory.configure(**kw) [Tue Oct 30 00:00:42.921704 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/debtcollector/renames.py", line 43, in decorator [Tue Oct 30 00:00:42.921725 2018] [:error] [pid 15] [remote 172.17.0.11:72] return wrapped(*args, **kwargs) [Tue Oct 30 00:00:42.921735 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 312, in configure [Tue Oct 30 00:00:42.921755 2018] [:error] [pid 15] [remote 172.17.0.11:72] self._configure(False, kw) [Tue Oct 30 00:00:42.921764 2018] [:error] [pid 15] [remote 172.17.0.11:72] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 317, in _configure [Tue Oct 30 00:00:42.921784 2018] [:error] [pid 15] [remote 172.17.0.11:72] raise TypeError("this TransactionFactory is already started") [Tue Oct 30 00:00:42.921799 2018] [:error] [pid 15] [remote 172.17.0.11:72] TypeError: this TransactionFactory is already started Only controller-0 shows no errors where nova-placement started correct after the api db_sync. After a restart of nova_placement container on ctrl1+2 we can start instances on the overcloud. This is fixed with BZ1636399 / openstack-tripleo-heat-templates-9.0.1-0.20181013060866.ffbe879.el7ost where we make sure nova-placement gets started on ctrl1+2 after the nova api db_sync finished on ctr0. *** This bug has been marked as a duplicate of bug 1636399 *** |