Bug 1641877 - [UPGRADES][14] Failed to spawn vm after controllers upgrade
Summary: [UPGRADES][14] Failed to spawn vm after controllers upgrade
Keywords:
Status: CLOSED DUPLICATE of bug 1630069
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-23 05:39 UTC by Yurii Prokulevych
Modified: 2023-03-21 19:02 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-25 12:41:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Yurii Prokulevych 2018-10-23 05:39:18 UTC
Description of problem:
-----------------------
After controllers got upgraded to RHOS-14 failed to spawn instance.
Fault method reports 'No valid host was found':

fault:
  code: 500
  created: '2018-10-22T13:12:01Z'
  details: "  File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\"\
    , line 1206, in schedule_and_build_instances\n    instance_uuids, return_alternates=True)\n\
    \  File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 723,\
    \ in _schedule_instances\n    return_alternates=return_alternates)\n  File \"\
    /usr/lib/python2.7/site-packages/nova/scheduler/utils.py\", line 907, in wrapped\n\
    \    return func(*args, **kwargs)\n  File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\"\
    , line 53, in select_destinations\n    instance_uuids, return_objects, return_alternates)\n\
    \  File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\"\
    , line 37, in __run_method\n    return getattr(self.instance, __name)(*args, **kwargs)\n\
    \  File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py\", line\
    \ 42, in select_destinations\n    instance_uuids, return_objects, return_alternates)\n\
    \  File \"/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py\", line 158,\
    \ in select_destinations\n    return cctxt.call(ctxt, 'select_destinations', **msg_args)\n\
    \  File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line\
    \ 179, in call\n    retry=self.retry)\n  File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\"\
    , line 133, in _send\n    retry=retry)\n  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\"\
    , line 584, in send\n    call_monitor_timeout, retry=retry)\n  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\"\
    , line 575, in _send\n    raise result\n"
  message: 'No valid host was found. '

In scheduler.log there is traceback related to placement api:
-------------------------------------------------------------
2018-10-22 13:12:00.458 29 DEBUG nova.scheduler.manager [req-8d0de298-6c01-40da-bc8d-a75c20795c70 e87c1cb8373340f6b76483259fb3a9ef 6044441e1dcb4e809b4f28c247239feb - default default] Starting to schedule for ins
tances: [u'f3dfe99d-2918-462b-b5dd-647c005e9f28'] select_destinations /usr/lib/python2.7/site-packages/nova/scheduler/manager.py:117
2018-10-22 13:12:00.459 29 DEBUG oslo_concurrency.lockutils [req-8d0de298-6c01-40da-bc8d-a75c20795c70 e87c1cb8373340f6b76483259fb3a9ef 6044441e1dcb4e809b4f28c247239feb - default default] Lock "placement_client"
acquired by "nova.scheduler.client.report._create_client" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2018-10-22 13:12:00.466 29 DEBUG oslo_concurrency.lockutils [req-8d0de298-6c01-40da-bc8d-a75c20795c70 e87c1cb8373340f6b76483259fb3a9ef 6044441e1dcb4e809b4f28c247239feb - default default] Lock "placement_client"
released by "nova.scheduler.client.report._create_client" :: held 0.006s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
2018-10-22 13:12:01.262 29 ERROR nova.scheduler.client.report [req-8d0de298-6c01-40da-bc8d-a75c20795c70 e87c1cb8373340f6b76483259fb3a9ef 6044441e1dcb4e809b4f28c247239feb - default default] Failed to retrieve all
ocation candidates from placement API for filters: RequestGroup(use_same_provider=False, resources={DISK_GB:1, MEMORY_MB:128, VCPU:1}, traits=[], aggregates=[])
Got 500: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator at
 [no address given] to inform them of the time this error occurred,
 and the actions you performed just before this error.</p>
<p>More information about this error may be available
in the server error log.</p>
</body></html>
.
2018-10-22 13:12:01.263 29 INFO nova.scheduler.manager [req-8d0de298-6c01-40da-bc8d-a75c20795c70 e87c1cb8373340f6b76483259fb3a9ef 6044441e1dcb4e809b4f28c247239feb - default default] Got no allocation candidates
from the Placement API. This could be due to insufficient resources or a temporary occurrence as compute nodes start up.
2018-10-22 13:12:10.947 31 DEBUG oslo_service.periodic_task [req-904b6800-292f-4eca-84df-792cb81bde02 - - - - -] Running periodic task SchedulerManager._run_periodic_tasks run_periodic_tasks /usr/lib/python2.7/s
ite-packages/oslo_service/periodic_task.py:215

And on the compute node:
------------------------
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager [req-4f839bda-36b7-4cf9-ad9b-2beee58ad5ff 8defd999c50c482084879a1ca5edc207 0c6def3a05154b97acaa82cdb523acbd - default default] Error updating resources for node compute-0.localdomain.: ResourceProviderAggregateRetrievalFailed: Failed to get aggregates for resource provider with UUID 4c2f132d-6571-4724-ab7a-251dc7bc5919
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager Traceback (most recent call last):
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7269, in update_available_resource_for_node
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     rt.update_available_resource(context, nodename)
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 686, in update_available_resource
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     self._update_available_resource(context, resources)
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     return f(*args, **kwargs)
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 710, in _update_available_resource
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     self._init_compute_node(context, resources)
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 561, in _init_compute_node
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     self._update(context, cn)
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 884, in _update
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     inv_data,
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 68, in set_inventory_for_provider
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     parent_provider_uuid=parent_provider_uuid,
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     return getattr(self.instance, __name)(*args, **kwargs)
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 1103, in set_inventory_for_provider
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     parent_provider_uuid=parent_provider_uuid)
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 655, in _ensure_resource_provider
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     self._refresh_associations(context, u, force=False)
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 778, in _refresh_associations
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     aggs = self._get_provider_aggregates(context, rp_uuid)
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 64, in wrapper
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     return f(self, *a, **k)
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 407, in _get_provider_aggregates
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager     raise exception.ResourceProviderAggregateRetrievalFailed(uuid=rp_uuid)
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager ResourceProviderAggregateRetrievalFailed: Failed to get aggregates for resource provider with UUID 4c2f132d-6571-4724-ab7a-251dc7bc5919
2018-10-22 13:42:25.905 1 ERROR nova.compute.manager 



Version-Release number of selected component (if applicable):
-------------------------------------------------------------
python-nova-18.0.2-0.20181002032919.c2045ed.el7ost.noarch
openstack-nova-console-18.0.2-0.20181002032919.c2045ed.el7ost.noarch
puppet-nova-13.3.1-0.20181001154308.3f8c3ee.el7ost.noarch
python2-novaclient-11.0.0-0.20180809174649.f1005ce.el7ost.noarch
openstack-nova-novncproxy-18.0.2-0.20181002032919.c2045ed.el7ost.noarch
openstack-nova-conductor-18.0.2-0.20181002032919.c2045ed.el7ost.noarch
openstack-nova-api-18.0.2-0.20181002032919.c2045ed.el7ost.noarch
openstack-nova-migration-18.0.2-0.20181002032919.c2045ed.el7ost.noarch
openstack-nova-scheduler-18.0.2-0.20181002032919.c2045ed.el7ost.noarch
openstack-nova-compute-18.0.2-0.20181002032919.c2045ed.el7ost.noarch
openstack-nova-common-18.0.2-0.20181002032919.c2045ed.el7ost.noarch
openstack-nova-placement-api-18.0.2-0.20181002032919.c2045ed.el7ost.noarch

openstack-tripleo-heat-templates-9.0.0-0.20181001174822.90afd18.0rc2.el7ost.noarch
python-tripleoclient-10.6.1-0.20180929200237.1d8dcb6.el7ost.noarch

Steps to Reproduce:
-------------------
1. Upgrade UC to RHOS-14
2. Setup repos/containers for oc
3. Run upgrade prepare to setup upgrade playbooks
4. Upgrade controller nodes
5. Try to spawn instance

Actual results:
---------------
Failed to spawn instance


Expected results:
-----------------
Instance is spawned successfully

Additional info:
----------------
BM setup upgraded from 12->13->14
3controllers + 2computes + external ceph

Comment 3 Yurii Prokulevych 2018-10-23 09:03:00 UTC
Restarting nova containers on controllers didn't help.
Was able to spawn vm after computes got upgraded to 14

Comment 4 Dan Smith 2018-10-24 14:07:22 UTC
In the placement log for controllers 1 and 2, there are a ton of database-related errors, which appear to be causing every query to result in a 500 (as reported). Here are two of the types of errors I see, quoted from controller 1's log:

[Mon Oct 22 12:02:54.381200 2018] [:error] [pid 15] [remote 192.168.172.12:92] mod_wsgi (pid=15): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api
'.
[Mon Oct 22 12:02:54.381234 2018] [:error] [pid 15] [remote 192.168.172.12:92] Traceback (most recent call last):
[Mon Oct 22 12:02:54.381257 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module>
[Mon Oct 22 12:02:54.381298 2018] [:error] [pid 15] [remote 192.168.172.12:92]     application = init_application()
[Mon Oct 22 12:02:54.381313 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 120, in init_appl
ication
[Mon Oct 22 12:02:54.381345 2018] [:error] [pid 15] [remote 192.168.172.12:92]     return deploy.loadapp(conf.CONF)
[Mon Oct 22 12:02:54.381358 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/deploy.py", line 119, in loadapp
[Mon Oct 22 12:02:54.381485 2018] [:error] [pid 15] [remote 192.168.172.12:92]     update_database()
[Mon Oct 22 12:02:54.381502 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/deploy.py", line 99, in update_d
atabase
[Mon Oct 22 12:02:54.381531 2018] [:error] [pid 15] [remote 192.168.172.12:92]     resource_provider.ensure_trait_sync(ctx)
[Mon Oct 22 12:02:54.381546 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/objects/resource_provider.py", l
ine 146, in ensure_trait_sync
[Mon Oct 22 12:02:54.382796 2018] [:error] [pid 15] [remote 192.168.172.12:92]     _trait_sync(ctx)
[Mon Oct 22 12:02:54.382827 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 154, in wrapper
[Mon Oct 22 12:02:54.382987 2018] [:error] [pid 15] [remote 192.168.172.12:92]     ectxt.value = e.inner_exc
[Mon Oct 22 12:02:54.383006 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
[Mon Oct 22 12:02:54.383172 2018] [:error] [pid 15] [remote 192.168.172.12:92]     self.force_reraise()
[Mon Oct 22 12:02:54.383190 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
[Mon Oct 22 12:02:54.383218 2018] [:error] [pid 15] [remote 192.168.172.12:92]     six.reraise(self.type_, self.value, self.tb)
[Mon Oct 22 12:02:54.383231 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 142, in wrapper
[Mon Oct 22 12:02:54.383254 2018] [:error] [pid 15] [remote 192.168.172.12:92]     return f(*args, **kwargs)
[Mon Oct 22 12:02:54.383267 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 993, in wrapper
[Mon Oct 22 12:02:54.383674 2018] [:error] [pid 15] [remote 192.168.172.12:92]     return fn(*args, **kwargs)
[Mon Oct 22 12:02:54.383699 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/objects/resource_provider.py", l
ine 102, in _trait_sync
[Mon Oct 22 12:02:54.383732 2018] [:error] [pid 15] [remote 192.168.172.12:92]     res = ctx.session.execute(sel).fetchall()
[Mon Oct 22 12:02:54.383749 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1176, in execute
[Mon Oct 22 12:02:54.384573 2018] [:error] [pid 15] [remote 192.168.172.12:92]     bind, close_with_result=True).execute(clause, params or {})
[Mon Oct 22 12:02:54.384593 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 948, in execute
[Mon Oct 22 12:02:54.385211 2018] [:error] [pid 15] [remote 192.168.172.12:92]     return meth(self, multiparams, params)
[Mon Oct 22 12:02:54.385236 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 269, in _execute_on_conne
ction
[Mon Oct 22 12:02:54.386328 2018] [:error] [pid 15] [remote 192.168.172.12:92]     return connection._execute_clauseelement(self, multiparams, params)
[Mon Oct 22 12:02:54.386355 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1060, in _execute_clauseel
ement
[Mon Oct 22 12:02:54.386388 2018] [:error] [pid 15] [remote 192.168.172.12:92]     compiled_sql, distilled_params
[Mon Oct 22 12:02:54.386402 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context
[Mon Oct 22 12:02:54.386425 2018] [:error] [pid 15] [remote 192.168.172.12:92]     context)
[Mon Oct 22 12:02:54.386438 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1409, in _handle_dbapi_exc
eption
[Mon Oct 22 12:02:54.386461 2018] [:error] [pid 15] [remote 192.168.172.12:92]     util.raise_from_cause(newraise, exc_info)
[Mon Oct 22 12:02:54.386474 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
[Mon Oct 22 12:02:54.386600 2018] [:error] [pid 15] [remote 192.168.172.12:92]     reraise(type(exception), exception, tb=exc_tb, cause=cause)
[Mon Oct 22 12:02:54.386617 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
[Mon Oct 22 12:02:54.386644 2018] [:error] [pid 15] [remote 192.168.172.12:92]     context)
[Mon Oct 22 12:02:54.386681 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 507, in do_execute
[Mon Oct 22 12:02:54.387083 2018] [:error] [pid 15] [remote 192.168.172.12:92]     cursor.execute(statement, parameters)
[Mon Oct 22 12:02:54.387103 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 166, in execute
[Mon Oct 22 12:02:54.387307 2018] [:error] [pid 15] [remote 192.168.172.12:92]     result = self._query(query)
[Mon Oct 22 12:02:54.387325 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 322, in _query
[Mon Oct 22 12:02:54.387352 2018] [:error] [pid 15] [remote 192.168.172.12:92]     conn.query(q)
[Mon Oct 22 12:02:54.387366 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 856, in query
[Mon Oct 22 12:02:54.387827 2018] [:error] [pid 15] [remote 192.168.172.12:92]     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
[Mon Oct 22 12:02:54.387850 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1057, in _read_query_result
[Mon Oct 22 12:02:54.387879 2018] [:error] [pid 15] [remote 192.168.172.12:92]     result.read()
[Mon Oct 22 12:02:54.387897 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1340, in read
[Mon Oct 22 12:02:54.387922 2018] [:error] [pid 15] [remote 192.168.172.12:92]     first_packet = self.connection._read_packet()
[Mon Oct 22 12:02:54.387935 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1014, in _read_packet
[Mon Oct 22 12:02:54.387958 2018] [:error] [pid 15] [remote 192.168.172.12:92]     packet.check_error()
[Mon Oct 22 12:02:54.387971 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 393, in check_error
[Mon Oct 22 12:02:54.388007 2018] [:error] [pid 15] [remote 192.168.172.12:92]     err.raise_mysql_exception(self._data)
[Mon Oct 22 12:02:54.388022 2018] [:error] [pid 15] [remote 192.168.172.12:92]   File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 107, in raise_mysql_exception
[Mon Oct 22 12:02:54.388122 2018] [:error] [pid 15] [remote 192.168.172.12:92]     raise errorclass(errno, errval)
[Mon Oct 22 12:02:54.388213 2018] [:error] [pid 15] [remote 192.168.172.12:92] ProgrammingError: (pymysql.err.ProgrammingError) (1146, u"Table 'nova_placement.traits' doesn't exist"
) [SQL: u'SELECT traits.name \\nFROM traits'] (Background on this error at: http://sqlalche.me/e/f405)

[Mon Oct 22 14:04:06.902733 2018] [:error] [pid 16] [remote 192.168.172.18:84] mod_wsgi (pid=16): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'.
[Mon Oct 22 14:04:06.902761 2018] [:error] [pid 16] [remote 192.168.172.18:84] Traceback (most recent call last):
[Mon Oct 22 14:04:06.902784 2018] [:error] [pid 16] [remote 192.168.172.18:84]   File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module>
[Mon Oct 22 14:04:06.902827 2018] [:error] [pid 16] [remote 192.168.172.18:84]     application = init_application()
[Mon Oct 22 14:04:06.902841 2018] [:error] [pid 16] [remote 192.168.172.18:84]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 108, in init_application
[Mon Oct 22 14:04:06.902875 2018] [:error] [pid 16] [remote 192.168.172.18:84]     db_api.configure(conf.CONF)
[Mon Oct 22 14:04:06.902889 2018] [:error] [pid 16] [remote 192.168.172.18:84]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/db_api.py", line 35, in configure
[Mon Oct 22 14:04:06.902915 2018] [:error] [pid 16] [remote 192.168.172.18:84]     **_get_db_conf(conf.placement_database))
[Mon Oct 22 14:04:06.902928 2018] [:error] [pid 16] [remote 192.168.172.18:84]   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 788, in configure
[Mon Oct 22 14:04:06.902955 2018] [:error] [pid 16] [remote 192.168.172.18:84]     self._factory.configure(**kw)
[Mon Oct 22 14:04:06.902968 2018] [:error] [pid 16] [remote 192.168.172.18:84]   File "/usr/lib/python2.7/site-packages/debtcollector/renames.py", line 43, in decorator
[Mon Oct 22 14:04:06.903001 2018] [:error] [pid 16] [remote 192.168.172.18:84]     return wrapped(*args, **kwargs)
[Mon Oct 22 14:04:06.903014 2018] [:error] [pid 16] [remote 192.168.172.18:84]   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 312, in configure
[Mon Oct 22 14:04:06.903039 2018] [:error] [pid 16] [remote 192.168.172.18:84]     self._configure(False, kw)
[Mon Oct 22 14:04:06.903051 2018] [:error] [pid 16] [remote 192.168.172.18:84]   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 317, in _configure
[Mon Oct 22 14:04:06.903082 2018] [:error] [pid 16] [remote 192.168.172.18:84]     raise TypeError("this TransactionFactory is already started")
[Mon Oct 22 14:04:06.903105 2018] [:error] [pid 16] [remote 192.168.172.18:84] TypeError: this TransactionFactory is already started

Controller 0 seems to be fine. I see no indication that this would be related to the computes being un-upgraded. Did you try restarting the httpd container on all controllers? Is it possible that you didn't, but that happened as a result of doing the compute upgrade?

Comment 5 Yurii Prokulevych 2018-10-24 15:23:22 UTC
Compute upgrade doesn't touches controller nodes.
I cannot find httpd container on controller nodes, could U please elaborate which one should I try restarting next time? (tried nova_metadata nova_api nova_conductor nova_vnc_proxy nova_consoleauth nova_api_cron nova_scheduler nova_placement)
DB Errors most likely are related to controller upgrades(nodes are stopped and upgraded one-by-one)
I'll try to get in touch with U next time I have env handy.

Thanks!

Comment 6 Dan Smith 2018-10-24 16:39:16 UTC
These logs are from sosreport-failed-placemant-boot-upg-controller-1-20181022135145/var/log/containers/httpd/nova-placement/placement_wsgi_error.log

So I would expect that whatever container you need to restart to bump those services would be relevant.

The errors in those logs are definitely related to why scheduler gets 500s from placement, and seem to indicate that placement on those controllers either can't talk to the database, or is talking to the wrong one. Looking at the placement access log on those controllers, many requests were made, all of which failed with 500. So I don't think these errors were just transient during the upgrade -- there is no indication that those placement services worked at all.

Comment 7 Matthew Booth 2018-10-25 12:41:12 UTC
(In reply to Dan Smith from comment #4)
> "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/db_api.py",
> line 35, in configure
> [Mon Oct 22 14:04:06.902915 2018] [:error] [pid 16] [remote
> 192.168.172.18:84]     **_get_db_conf(conf.placement_database))
> [Mon Oct 22 14:04:06.902928 2018] [:error] [pid 16] [remote
> 192.168.172.18:84]   File
> "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line
> 788, in configure
> [Mon Oct 22 14:04:06.902955 2018] [:error] [pid 16] [remote
> 192.168.172.18:84]     self._factory.configure(**kw)
> [Mon Oct 22 14:04:06.902968 2018] [:error] [pid 16] [remote
> 192.168.172.18:84]   File
> "/usr/lib/python2.7/site-packages/debtcollector/renames.py", line 43, in
> decorator
> [Mon Oct 22 14:04:06.903001 2018] [:error] [pid 16] [remote
> 192.168.172.18:84]     return wrapped(*args, **kwargs)
> [Mon Oct 22 14:04:06.903014 2018] [:error] [pid 16] [remote
> 192.168.172.18:84]   File
> "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line
> 312, in configure
> [Mon Oct 22 14:04:06.903039 2018] [:error] [pid 16] [remote
> 192.168.172.18:84]     self._configure(False, kw)
> [Mon Oct 22 14:04:06.903051 2018] [:error] [pid 16] [remote
> 192.168.172.18:84]   File
> "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line
> 317, in _configure
> [Mon Oct 22 14:04:06.903082 2018] [:error] [pid 16] [remote
> 192.168.172.18:84]     raise TypeError("this TransactionFactory is already
> started")
> [Mon Oct 22 14:04:06.903105 2018] [:error] [pid 16] [remote
> 192.168.172.18:84] TypeError: this TransactionFactory is already started

This looks like a dup of bug 1630069 to me, which is caused by starting placement before db_sync completes. There are 2 different patches in progress for that.

*** This bug has been marked as a duplicate of bug 1630069 ***


Note You need to log in before you can comment on or make changes to this bug.