Description of problem: ----------------------- After controllers got upgraded to RHOS-14 failed to spawn instance. Fault method reports 'No valid host was found': fault: code: 500 created: '2018-10-22T13:12:01Z' details: " File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\"\ , line 1206, in schedule_and_build_instances\n instance_uuids, return_alternates=True)\n\ \ File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 723,\ \ in _schedule_instances\n return_alternates=return_alternates)\n File \"\ /usr/lib/python2.7/site-packages/nova/scheduler/utils.py\", line 907, in wrapped\n\ \ return func(*args, **kwargs)\n File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\"\ , line 53, in select_destinations\n instance_uuids, return_objects, return_alternates)\n\ \ File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\"\ , line 37, in __run_method\n return getattr(self.instance, __name)(*args, **kwargs)\n\ \ File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py\", line\ \ 42, in select_destinations\n instance_uuids, return_objects, return_alternates)\n\ \ File \"/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py\", line 158,\ \ in select_destinations\n return cctxt.call(ctxt, 'select_destinations', **msg_args)\n\ \ File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line\ \ 179, in call\n retry=self.retry)\n File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\"\ , line 133, in _send\n retry=retry)\n File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\"\ , line 584, in send\n call_monitor_timeout, retry=retry)\n File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\"\ , line 575, in _send\n raise result\n" message: 'No valid host was found. ' In scheduler.log there is traceback related to placement api: ------------------------------------------------------------- 2018-10-22 13:12:00.458 29 DEBUG nova.scheduler.manager [req-8d0de298-6c01-40da-bc8d-a75c20795c70 e87c1cb8373340f6b76483259fb3a9ef 6044441e1dcb4e809b4f28c247239feb - default default] Starting to schedule for ins tances: [u'f3dfe99d-2918-462b-b5dd-647c005e9f28'] select_destinations /usr/lib/python2.7/site-packages/nova/scheduler/manager.py:117 2018-10-22 13:12:00.459 29 DEBUG oslo_concurrency.lockutils [req-8d0de298-6c01-40da-bc8d-a75c20795c70 e87c1cb8373340f6b76483259fb3a9ef 6044441e1dcb4e809b4f28c247239feb - default default] Lock "placement_client" acquired by "nova.scheduler.client.report._create_client" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 2018-10-22 13:12:00.466 29 DEBUG oslo_concurrency.lockutils [req-8d0de298-6c01-40da-bc8d-a75c20795c70 e87c1cb8373340f6b76483259fb3a9ef 6044441e1dcb4e809b4f28c247239feb - default default] Lock "placement_client" released by "nova.scheduler.client.report._create_client" :: held 0.006s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 2018-10-22 13:12:01.262 29 ERROR nova.scheduler.client.report [req-8d0de298-6c01-40da-bc8d-a75c20795c70 e87c1cb8373340f6b76483259fb3a9ef 6044441e1dcb4e809b4f28c247239feb - default default] Failed to retrieve all ocation candidates from placement API for filters: RequestGroup(use_same_provider=False, resources={DISK_GB:1, MEMORY_MB:128, VCPU:1}, traits=[], aggregates=[]) Got 500: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>500 Internal Server Error</title> </head><body> <h1>Internal Server Error</h1> <p>The server encountered an internal error or misconfiguration and was unable to complete your request.</p> <p>Please contact the server administrator at [no address given] to inform them of the time this error occurred, and the actions you performed just before this error.</p> <p>More information about this error may be available in the server error log.</p> </body></html> . 2018-10-22 13:12:01.263 29 INFO nova.scheduler.manager [req-8d0de298-6c01-40da-bc8d-a75c20795c70 e87c1cb8373340f6b76483259fb3a9ef 6044441e1dcb4e809b4f28c247239feb - default default] Got no allocation candidates from the Placement API. This could be due to insufficient resources or a temporary occurrence as compute nodes start up. 2018-10-22 13:12:10.947 31 DEBUG oslo_service.periodic_task [req-904b6800-292f-4eca-84df-792cb81bde02 - - - - -] Running periodic task SchedulerManager._run_periodic_tasks run_periodic_tasks /usr/lib/python2.7/s ite-packages/oslo_service/periodic_task.py:215 And on the compute node: ------------------------ 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager [req-4f839bda-36b7-4cf9-ad9b-2beee58ad5ff 8defd999c50c482084879a1ca5edc207 0c6def3a05154b97acaa82cdb523acbd - default default] Error updating resources for node compute-0.localdomain.: ResourceProviderAggregateRetrievalFailed: Failed to get aggregates for resource provider with UUID 4c2f132d-6571-4724-ab7a-251dc7bc5919 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager Traceback (most recent call last): 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7269, in update_available_resource_for_node 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager rt.update_available_resource(context, nodename) 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 686, in update_available_resource 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager self._update_available_resource(context, resources) 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager return f(*args, **kwargs) 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 710, in _update_available_resource 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager self._init_compute_node(context, resources) 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 561, in _init_compute_node 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager self._update(context, cn) 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 884, in _update 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager inv_data, 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 68, in set_inventory_for_provider 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager parent_provider_uuid=parent_provider_uuid, 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager return getattr(self.instance, __name)(*args, **kwargs) 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 1103, in set_inventory_for_provider 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager parent_provider_uuid=parent_provider_uuid) 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 655, in _ensure_resource_provider 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager self._refresh_associations(context, u, force=False) 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 778, in _refresh_associations 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager aggs = self._get_provider_aggregates(context, rp_uuid) 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 64, in wrapper 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager return f(self, *a, **k) 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 407, in _get_provider_aggregates 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager raise exception.ResourceProviderAggregateRetrievalFailed(uuid=rp_uuid) 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager ResourceProviderAggregateRetrievalFailed: Failed to get aggregates for resource provider with UUID 4c2f132d-6571-4724-ab7a-251dc7bc5919 2018-10-22 13:42:25.905 1 ERROR nova.compute.manager Version-Release number of selected component (if applicable): ------------------------------------------------------------- python-nova-18.0.2-0.20181002032919.c2045ed.el7ost.noarch openstack-nova-console-18.0.2-0.20181002032919.c2045ed.el7ost.noarch puppet-nova-13.3.1-0.20181001154308.3f8c3ee.el7ost.noarch python2-novaclient-11.0.0-0.20180809174649.f1005ce.el7ost.noarch openstack-nova-novncproxy-18.0.2-0.20181002032919.c2045ed.el7ost.noarch openstack-nova-conductor-18.0.2-0.20181002032919.c2045ed.el7ost.noarch openstack-nova-api-18.0.2-0.20181002032919.c2045ed.el7ost.noarch openstack-nova-migration-18.0.2-0.20181002032919.c2045ed.el7ost.noarch openstack-nova-scheduler-18.0.2-0.20181002032919.c2045ed.el7ost.noarch openstack-nova-compute-18.0.2-0.20181002032919.c2045ed.el7ost.noarch openstack-nova-common-18.0.2-0.20181002032919.c2045ed.el7ost.noarch openstack-nova-placement-api-18.0.2-0.20181002032919.c2045ed.el7ost.noarch openstack-tripleo-heat-templates-9.0.0-0.20181001174822.90afd18.0rc2.el7ost.noarch python-tripleoclient-10.6.1-0.20180929200237.1d8dcb6.el7ost.noarch Steps to Reproduce: ------------------- 1. Upgrade UC to RHOS-14 2. Setup repos/containers for oc 3. Run upgrade prepare to setup upgrade playbooks 4. Upgrade controller nodes 5. Try to spawn instance Actual results: --------------- Failed to spawn instance Expected results: ----------------- Instance is spawned successfully Additional info: ---------------- BM setup upgraded from 12->13->14 3controllers + 2computes + external ceph
Restarting nova containers on controllers didn't help. Was able to spawn vm after computes got upgraded to 14
In the placement log for controllers 1 and 2, there are a ton of database-related errors, which appear to be causing every query to result in a 500 (as reported). Here are two of the types of errors I see, quoted from controller 1's log: [Mon Oct 22 12:02:54.381200 2018] [:error] [pid 15] [remote 192.168.172.12:92] mod_wsgi (pid=15): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api '. [Mon Oct 22 12:02:54.381234 2018] [:error] [pid 15] [remote 192.168.172.12:92] Traceback (most recent call last): [Mon Oct 22 12:02:54.381257 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module> [Mon Oct 22 12:02:54.381298 2018] [:error] [pid 15] [remote 192.168.172.12:92] application = init_application() [Mon Oct 22 12:02:54.381313 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 120, in init_appl ication [Mon Oct 22 12:02:54.381345 2018] [:error] [pid 15] [remote 192.168.172.12:92] return deploy.loadapp(conf.CONF) [Mon Oct 22 12:02:54.381358 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/deploy.py", line 119, in loadapp [Mon Oct 22 12:02:54.381485 2018] [:error] [pid 15] [remote 192.168.172.12:92] update_database() [Mon Oct 22 12:02:54.381502 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/deploy.py", line 99, in update_d atabase [Mon Oct 22 12:02:54.381531 2018] [:error] [pid 15] [remote 192.168.172.12:92] resource_provider.ensure_trait_sync(ctx) [Mon Oct 22 12:02:54.381546 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/objects/resource_provider.py", l ine 146, in ensure_trait_sync [Mon Oct 22 12:02:54.382796 2018] [:error] [pid 15] [remote 192.168.172.12:92] _trait_sync(ctx) [Mon Oct 22 12:02:54.382827 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 154, in wrapper [Mon Oct 22 12:02:54.382987 2018] [:error] [pid 15] [remote 192.168.172.12:92] ectxt.value = e.inner_exc [Mon Oct 22 12:02:54.383006 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ [Mon Oct 22 12:02:54.383172 2018] [:error] [pid 15] [remote 192.168.172.12:92] self.force_reraise() [Mon Oct 22 12:02:54.383190 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise [Mon Oct 22 12:02:54.383218 2018] [:error] [pid 15] [remote 192.168.172.12:92] six.reraise(self.type_, self.value, self.tb) [Mon Oct 22 12:02:54.383231 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 142, in wrapper [Mon Oct 22 12:02:54.383254 2018] [:error] [pid 15] [remote 192.168.172.12:92] return f(*args, **kwargs) [Mon Oct 22 12:02:54.383267 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 993, in wrapper [Mon Oct 22 12:02:54.383674 2018] [:error] [pid 15] [remote 192.168.172.12:92] return fn(*args, **kwargs) [Mon Oct 22 12:02:54.383699 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/objects/resource_provider.py", l ine 102, in _trait_sync [Mon Oct 22 12:02:54.383732 2018] [:error] [pid 15] [remote 192.168.172.12:92] res = ctx.session.execute(sel).fetchall() [Mon Oct 22 12:02:54.383749 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1176, in execute [Mon Oct 22 12:02:54.384573 2018] [:error] [pid 15] [remote 192.168.172.12:92] bind, close_with_result=True).execute(clause, params or {}) [Mon Oct 22 12:02:54.384593 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 948, in execute [Mon Oct 22 12:02:54.385211 2018] [:error] [pid 15] [remote 192.168.172.12:92] return meth(self, multiparams, params) [Mon Oct 22 12:02:54.385236 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 269, in _execute_on_conne ction [Mon Oct 22 12:02:54.386328 2018] [:error] [pid 15] [remote 192.168.172.12:92] return connection._execute_clauseelement(self, multiparams, params) [Mon Oct 22 12:02:54.386355 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1060, in _execute_clauseel ement [Mon Oct 22 12:02:54.386388 2018] [:error] [pid 15] [remote 192.168.172.12:92] compiled_sql, distilled_params [Mon Oct 22 12:02:54.386402 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context [Mon Oct 22 12:02:54.386425 2018] [:error] [pid 15] [remote 192.168.172.12:92] context) [Mon Oct 22 12:02:54.386438 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1409, in _handle_dbapi_exc eption [Mon Oct 22 12:02:54.386461 2018] [:error] [pid 15] [remote 192.168.172.12:92] util.raise_from_cause(newraise, exc_info) [Mon Oct 22 12:02:54.386474 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause [Mon Oct 22 12:02:54.386600 2018] [:error] [pid 15] [remote 192.168.172.12:92] reraise(type(exception), exception, tb=exc_tb, cause=cause) [Mon Oct 22 12:02:54.386617 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context [Mon Oct 22 12:02:54.386644 2018] [:error] [pid 15] [remote 192.168.172.12:92] context) [Mon Oct 22 12:02:54.386681 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 507, in do_execute [Mon Oct 22 12:02:54.387083 2018] [:error] [pid 15] [remote 192.168.172.12:92] cursor.execute(statement, parameters) [Mon Oct 22 12:02:54.387103 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 166, in execute [Mon Oct 22 12:02:54.387307 2018] [:error] [pid 15] [remote 192.168.172.12:92] result = self._query(query) [Mon Oct 22 12:02:54.387325 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 322, in _query [Mon Oct 22 12:02:54.387352 2018] [:error] [pid 15] [remote 192.168.172.12:92] conn.query(q) [Mon Oct 22 12:02:54.387366 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 856, in query [Mon Oct 22 12:02:54.387827 2018] [:error] [pid 15] [remote 192.168.172.12:92] self._affected_rows = self._read_query_result(unbuffered=unbuffered) [Mon Oct 22 12:02:54.387850 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1057, in _read_query_result [Mon Oct 22 12:02:54.387879 2018] [:error] [pid 15] [remote 192.168.172.12:92] result.read() [Mon Oct 22 12:02:54.387897 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1340, in read [Mon Oct 22 12:02:54.387922 2018] [:error] [pid 15] [remote 192.168.172.12:92] first_packet = self.connection._read_packet() [Mon Oct 22 12:02:54.387935 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1014, in _read_packet [Mon Oct 22 12:02:54.387958 2018] [:error] [pid 15] [remote 192.168.172.12:92] packet.check_error() [Mon Oct 22 12:02:54.387971 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 393, in check_error [Mon Oct 22 12:02:54.388007 2018] [:error] [pid 15] [remote 192.168.172.12:92] err.raise_mysql_exception(self._data) [Mon Oct 22 12:02:54.388022 2018] [:error] [pid 15] [remote 192.168.172.12:92] File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 107, in raise_mysql_exception [Mon Oct 22 12:02:54.388122 2018] [:error] [pid 15] [remote 192.168.172.12:92] raise errorclass(errno, errval) [Mon Oct 22 12:02:54.388213 2018] [:error] [pid 15] [remote 192.168.172.12:92] ProgrammingError: (pymysql.err.ProgrammingError) (1146, u"Table 'nova_placement.traits' doesn't exist" ) [SQL: u'SELECT traits.name \\nFROM traits'] (Background on this error at: http://sqlalche.me/e/f405) [Mon Oct 22 14:04:06.902733 2018] [:error] [pid 16] [remote 192.168.172.18:84] mod_wsgi (pid=16): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'. [Mon Oct 22 14:04:06.902761 2018] [:error] [pid 16] [remote 192.168.172.18:84] Traceback (most recent call last): [Mon Oct 22 14:04:06.902784 2018] [:error] [pid 16] [remote 192.168.172.18:84] File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module> [Mon Oct 22 14:04:06.902827 2018] [:error] [pid 16] [remote 192.168.172.18:84] application = init_application() [Mon Oct 22 14:04:06.902841 2018] [:error] [pid 16] [remote 192.168.172.18:84] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 108, in init_application [Mon Oct 22 14:04:06.902875 2018] [:error] [pid 16] [remote 192.168.172.18:84] db_api.configure(conf.CONF) [Mon Oct 22 14:04:06.902889 2018] [:error] [pid 16] [remote 192.168.172.18:84] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/db_api.py", line 35, in configure [Mon Oct 22 14:04:06.902915 2018] [:error] [pid 16] [remote 192.168.172.18:84] **_get_db_conf(conf.placement_database)) [Mon Oct 22 14:04:06.902928 2018] [:error] [pid 16] [remote 192.168.172.18:84] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 788, in configure [Mon Oct 22 14:04:06.902955 2018] [:error] [pid 16] [remote 192.168.172.18:84] self._factory.configure(**kw) [Mon Oct 22 14:04:06.902968 2018] [:error] [pid 16] [remote 192.168.172.18:84] File "/usr/lib/python2.7/site-packages/debtcollector/renames.py", line 43, in decorator [Mon Oct 22 14:04:06.903001 2018] [:error] [pid 16] [remote 192.168.172.18:84] return wrapped(*args, **kwargs) [Mon Oct 22 14:04:06.903014 2018] [:error] [pid 16] [remote 192.168.172.18:84] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 312, in configure [Mon Oct 22 14:04:06.903039 2018] [:error] [pid 16] [remote 192.168.172.18:84] self._configure(False, kw) [Mon Oct 22 14:04:06.903051 2018] [:error] [pid 16] [remote 192.168.172.18:84] File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 317, in _configure [Mon Oct 22 14:04:06.903082 2018] [:error] [pid 16] [remote 192.168.172.18:84] raise TypeError("this TransactionFactory is already started") [Mon Oct 22 14:04:06.903105 2018] [:error] [pid 16] [remote 192.168.172.18:84] TypeError: this TransactionFactory is already started Controller 0 seems to be fine. I see no indication that this would be related to the computes being un-upgraded. Did you try restarting the httpd container on all controllers? Is it possible that you didn't, but that happened as a result of doing the compute upgrade?
Compute upgrade doesn't touches controller nodes. I cannot find httpd container on controller nodes, could U please elaborate which one should I try restarting next time? (tried nova_metadata nova_api nova_conductor nova_vnc_proxy nova_consoleauth nova_api_cron nova_scheduler nova_placement) DB Errors most likely are related to controller upgrades(nodes are stopped and upgraded one-by-one) I'll try to get in touch with U next time I have env handy. Thanks!
These logs are from sosreport-failed-placemant-boot-upg-controller-1-20181022135145/var/log/containers/httpd/nova-placement/placement_wsgi_error.log So I would expect that whatever container you need to restart to bump those services would be relevant. The errors in those logs are definitely related to why scheduler gets 500s from placement, and seem to indicate that placement on those controllers either can't talk to the database, or is talking to the wrong one. Looking at the placement access log on those controllers, many requests were made, all of which failed with 500. So I don't think these errors were just transient during the upgrade -- there is no indication that those placement services worked at all.
(In reply to Dan Smith from comment #4) > "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/db_api.py", > line 35, in configure > [Mon Oct 22 14:04:06.902915 2018] [:error] [pid 16] [remote > 192.168.172.18:84] **_get_db_conf(conf.placement_database)) > [Mon Oct 22 14:04:06.902928 2018] [:error] [pid 16] [remote > 192.168.172.18:84] File > "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line > 788, in configure > [Mon Oct 22 14:04:06.902955 2018] [:error] [pid 16] [remote > 192.168.172.18:84] self._factory.configure(**kw) > [Mon Oct 22 14:04:06.902968 2018] [:error] [pid 16] [remote > 192.168.172.18:84] File > "/usr/lib/python2.7/site-packages/debtcollector/renames.py", line 43, in > decorator > [Mon Oct 22 14:04:06.903001 2018] [:error] [pid 16] [remote > 192.168.172.18:84] return wrapped(*args, **kwargs) > [Mon Oct 22 14:04:06.903014 2018] [:error] [pid 16] [remote > 192.168.172.18:84] File > "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line > 312, in configure > [Mon Oct 22 14:04:06.903039 2018] [:error] [pid 16] [remote > 192.168.172.18:84] self._configure(False, kw) > [Mon Oct 22 14:04:06.903051 2018] [:error] [pid 16] [remote > 192.168.172.18:84] File > "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line > 317, in _configure > [Mon Oct 22 14:04:06.903082 2018] [:error] [pid 16] [remote > 192.168.172.18:84] raise TypeError("this TransactionFactory is already > started") > [Mon Oct 22 14:04:06.903105 2018] [:error] [pid 16] [remote > 192.168.172.18:84] TypeError: this TransactionFactory is already started This looks like a dup of bug 1630069 to me, which is caused by starting placement before db_sync completes. There are 2 different patches in progress for that. *** This bug has been marked as a duplicate of bug 1630069 ***