Description of problem: ----------------------- Attempt to re-run failed upgrade of controller nodes timed out: openstack overcloud upgrade run \ --stack overcloud \ --roles ControllerDeployedServer --playbook all 2>&1 ... 2018-11-07 09:15:47.735 160400 INFO osc_lib.shell [-] command: overcloud upgrade run -> tripleoclient.v1.overcloud_upgrade.UpgradeRun (auth=True) 2018-11-07 09:15:47.737 160400 INFO osc_lib.clientmanager [-] Using auth plugin: password 2018-11-07 09:15:47.737 160400 DEBUG osc_lib.clientmanager [-] Using parameters {'username': 'admin', 'project_name': 'admin', 'user_domain_name': 'Default', 'auth_url': 'https://192.168.24.2:13000/', 'password': '***', 'project_domain_name': 'Default'} setup_auth /usr/lib/python2.7/site-packages/osc_lib/clientmanager.py:157 2018-11-07 09:15:47.739 160400 DEBUG osc_lib.clientmanager [-] Get auth_ref auth_ref /usr/lib/python2.7/site-packages/osc_lib/clientmanager.py:201 2018-11-07 09:15:57.552 160400 WARNING tripleoclient.plugin [-] Waiting for messages on queue 'upgrade' with no timeout. 2018-11-07 10:15:56.601 160400 ERROR openstack [-] Connection is already closed.: WebSocketConnectionClosedException: Connection is already closed. 2018-11-07 10:15:56.603 160400 INFO osc_lib.shell [-] END return value: 1 Version-Release number of selected component (if applicable): ------------------------------------------------------------- python-tripleoclient-10.6.1-0.20181010222404.8c8f259.el7ost.noarch Steps to Reproduce: ------------------- 1. Upgrade UC to RHOS-14(2018-10-30.6) 2. Prepare playbooks for oc uprade 3. Prepare containers for oc upgrade 4. Make upgrade fail on any controller 5. Re-run upgrade Actual results: --------------- Failed to re-run upgrade Expected results: ----------------- Able to re-run upgrade Additional info: ---------------- Virtual setup of pre-previsioned servers - 3controllers + 2computes + 3ceph
From /var/log/containers/mistral/engine.log: ... 2018-11-07 10:27:28.794 1 INFO workflow_trace [req-e0566c49-1913-41cf-b482-e3cee408f5d3 - - - - -] Workflow 'tripleo.package_update.v1.update_nodes' [RUNNING -> ERROR, msg=Failed to run task [error=Can not evalu ate YAQL expression [expression=execution().name, error=WorkflowExecution not found [id=bb5827dc-7643-49e4-85ec-f89a956c2a47], data={}], wf=tripleo.package_update.v1.update_nodes, task=send_message]: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/mistral/engine/task_handler.py", line 63, in run_task task.run() File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper result = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 390, in run self._run_new() File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper result = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 419, in _run_new self._schedule_actions() File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 483, in _schedule_actions input_dict = self._get_action_input() File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper result = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 514, in _get_action_input input_dict = self._evaluate_expression(self.task_spec.get_input(), ctx) File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 540, in _evaluate_expression ctx_view File "/usr/lib/python2.7/site-packages/mistral/expressions/__init__.py", line 100, in evaluate_recursively data[key] = _evaluate_item(data[key], context) File "/usr/lib/python2.7/site-packages/mistral/expressions/__init__.py", line 79, in _evaluate_item return evaluate(item, context) File "/usr/lib/python2.7/site-packages/mistral/expressions/__init__.py", line 71, in evaluate return evaluator.evaluate(expression, context) File "/usr/lib/python2.7/site-packages/mistral/expressions/yaql_expression.py", line 159, in evaluate cls).evaluate(trim_expr, data_context) File "/usr/lib/python2.7/site-packages/mistral/expressions/yaql_expression.py", line 113, in evaluate ", data=%s]" % (expression, str(e), data_context) YaqlEvaluationException: Can not evaluate YAQL expression [expression=execution().name, error=WorkflowExecution not found [id=bb5827dc-7643-49e4-85ec-f89a956c2a47], data={}] ] (execution_id=bb5827dc-7643-49e4-85ec-f89a956c2a47)
Most of the errors reported there are unfortunately unrelated to the problem. The websocket connection error should probably caught and ignored. The "WorkflowExecution not found" is due to the mistral executor being restarted while the ansible action is still running. The core issue is that the ansible action is not reporting error. Looking at package_update.log which seems to be its output, the last lines say: TASK [Ensure all online data migrations for Cinder have been applied] It's supposed to run "cinder-manage db online_data_migrations" on controller-0, but nothing happens.
We managed to reproduce the same issue running an upgrade from OSP13 to OSP14: u'TASK [Ensure all online data migrations for Cinder have been applied] **********', u'Friday 16 November 2018 11:18:12 +0000 (0:00:00.393) 0:00:38.246 ******* ', u'skipping: [overcloud-controller-2] => {"changed": false, "skip_reason": "Conditional result was False"}', u'skipping: [overcloud-controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}'] [u'Failed to update nodes - Controller, please see the logs.'] 2018-11-16 13:22:15.373 338774 ERROR openstack [-] Connection is already closed.: WebSocketConnectionClosedException: Connection is already closed.^[[00m 2018-11-16 13:22:15.376 338774 INFO osc_lib.shell [-] END return value: 1^[[00m When investigating some of the containers in the overcloud controller nodes, it was observed that they lost connectivity with mysql. Also, the cluster was down. This caused the following ansible task to hung for the bootstrap controller node: https://github.com/openstack/tripleo-heat-templates/blob/89d13583f7c706d5c7fbd5052fbedec53474f248/docker/services/cinder-api.yaml#L263 As ansible-playbook process blocked, mistral ended timeouting and we got the WebSocketConnectionClosedException exception. So, we should ensure the cluster status is healthy before running the upgrade.
When we tried to bring the cluster back via a start, we observe that some of the resources didn't succeed: Failed Actions: * rabbitmq-bundle-docker-1_monitor_60000 on overcloud-controller-1 'not running' (7): call=84, status=complete, exitreason='', last-rc-change='Fri Nov 16 14:53:31 2018', queued=0ms, exec=0ms * galera-bundle-docker-1_start_0 on overcloud-controller-1 'unknown error' (1): call=80, status=complete, exitreason='failed to pull image 172.31.0.1:8787/rhosp14/openstack- mariadb:pcmklatest', last-rc-change='Fri Nov 16 14:52:30 2018', queued=0ms, exec=367ms * galera-bundle-docker-2_start_0 on overcloud-controller-1 'unknown error' (1): call=94, status=complete, exitreason='failed to pull image 172.31.0.1:8787/rhosp14/openstack- mariadb:pcmklatest', last-rc-change='Fri Nov 16 14:52:31 2018', queued=0ms, exec=249ms * redis-bundle-docker-1_monitor_60000 on overcloud-controller-1 'not running' (7): call=89, status=complete, exitreason='', last-rc-change='Fri Nov 16 14:53:31 2018', queued=0ms, exec=0ms * rabbitmq-bundle-1_monitor_60000 on overcloud-controller-1 'unknown error' (1): call=10, status=Error, exitreason='', last-rc-change='Fri Nov 16 14:53:02 2018', queued=0ms, exec=0ms * redis-bundle-1_monitor_60000 on overcloud-controller-1 'unknown error' (1): call=13, status=Error, exitreason='', last-rc-change='Fri Nov 16 14:53:17 2018', queued=0ms, exec=0ms * rabbitmq_start_0 on rabbitmq-bundle-0 'unknown error' (1): call=15, status=complete, exitreason='', last-rc-change='Fri Nov 16 14:52:50 2018', queued=0ms, exec=25797ms * rabbitmq-bundle-docker-2_monitor_60000 on overcloud-controller-2 'not running' (7): call=85, status=complete, exitreason='', last-rc-change='Fri Nov 16 14:53:31 2018', queued=0ms, exec=0ms * galera-bundle-docker-1_start_0 on overcloud-controller-2 'unknown error' (1): call=94, status=complete, exitreason='failed to pull image 172.31.0.1:8787/rhosp14/openstack- mariadb:pcmklatest', last-rc-change='Fri Nov 16 14:52:31 2018', queued=0ms, exec=277ms * galera-bundle-docker-2_start_0 on overcloud-controller-2 'unknown error' (1): call=80, status=complete, exitreason='failed to pull image 172.31.0.1:8787/rhosp14/openstack- mariadb:pcmklatest', last-rc-change='Fri Nov 16 14:52:30 2018', queued=0ms, exec=333ms * redis-bundle-docker-2_monitor_60000 on overcloud-controller-2 'not running' (7): call=90, status=complete, exitreason='', last-rc-change='Fri Nov 16 14:53:31 2018', queued=0ms, exec=0ms * rabbitmq-bundle-2_monitor_60000 on overcloud-controller-2 'unknown error' (1): call=10, status=Error, exitreason='', last-rc-change='Fri Nov 16 14:53:02 2018', queued=0ms, exec=0ms * redis-bundle-2_monitor_60000 on overcloud-controller-2 'unknown error' (1): call=13, status=Error, exitreason='', last-rc-change='Fri Nov 16 14:53:17 2018', queued=0ms, exec=0ms The galera-bundle resources are failing to initiate the mariadb container even though the pcmklatest tagged image is present locally in both controllers: [root@overcloud-controller-2 ~]# docker image list | grep pcmklatest 172.31.0.1:8787/rhosp14/openstack-cinder-volume pcmklatest 55d9bcaff11d 9 days ago 1.02 GB 172.31.0.1:8787/rhosp14/openstack-redis pcmklatest ec1e56a5d997 9 days ago 606 MB 172.31.0.1:8787/rhosp14/openstack-rabbitmq pcmklatest a0971285d430 10 days ago 641 MB 172.31.0.1:8787/rhosp14/openstack-haproxy pcmklatest 05e996dd8d0e 10 days ago 604 MB 172.31.0.1:8787/rhosp14/openstack-mariadb pcmklatest 1b590b262ae3 10 days ago 762 MB [root@overcloud-controller-1 ~]# docker image list | grep pcmklatest 172.31.0.1:8787/rhosp14/openstack-cinder-volume pcmklatest 55d9bcaff11d 9 days ago 1.02 GB 172.31.0.1:8787/rhosp14/openstack-redis pcmklatest ec1e56a5d997 9 days ago 606 MB 172.31.0.1:8787/rhosp14/openstack-rabbitmq pcmklatest a0971285d430 10 days ago 641 MB 172.31.0.1:8787/rhosp14/openstack-haproxy pcmklatest 05e996dd8d0e 10 days ago 604 MB 172.31.0.1:8787/rhosp14/openstack-mariadb pcmklatest 1b590b262ae3 10 days ago 762 MB Instead of taking the image which is present locally, it tries to retrieve it from the registry in the undercloud, where it's not present: (undercloud) [stack@c01-h01-r620 ~]$ docker image list | grep pcmklatest (undercloud) [stack@c01-h01-r620 ~]$
The first upgrade attempt fails on task: TASK [take new os-net-config parameters into account now] This probably created a networking issue on the controllers, and right after that this task failed due to machines not being reachable: 2018-11-15 15:00:14,525 p=24270 u=mistral | TASK [Update all packages] ***************************************************** 2018-11-15 15:00:14,525 p=24270 u=mistral | Thursday 15 November 2018 15:00:14 +0000 (0:00:02.148) 0:29:09.822 ***** 2018-11-15 15:06:59,449 p=24270 u=mistral | fatal: [overcloud-controller-2]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"172.31.0.9\". Make sure this host can be reached over ssh", "unreachable": true} 2018-11-15 15:06:59,557 p=24270 u=mistral | fatal: [overcloud-controller-1]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"172.31.0.15\". Make sure this host can be reached over ssh", "unreachable": true} 2018-11-15 15:07:17,498 p=24270 u=mistral | fatal: [overcloud-controller-0]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"172.31.0.14\". Make sure this host can be reached over ssh", "unreachable": true} This left the overcloud with stopped containers and we proceded retrying with the same command. When the services on the overcloud are already stopped and we want to recover, we must skip validations and pre-upgrade tasks. So we should re-run like this: openstack overcloud upgrade run --roles Controller --skip-tags validation,pre-upgrade
I got once more this error during the Controller node upgrade run. The log from the command shows: u'TASK [Update all packages] *****************************************************', u'Thursday 22 November 2018 10:00:01 +0000 (0:00:05.012) 2:38:45.134 ***** ', <LOOONG LOG FOR THE TASK> 2018-11-22 11:07:10.319 839625 ERROR openstack [-] Connection is already closed.: WebSocketConnectionClosedException: Connection is already closed.^[[00m 2018-11-22 11:07:10.324 839625 INFO osc_lib.shell [-] END return value: 1^[[00m But, when checking the package_update.log, Ansible continued running tasks much further (I can't paste the log, as it got rewritten by a second upgrade re-try) Right after 11:07:10.31 we can see the following logs in mistral executor: 2018-11-22 11:21:13.898 1 WARNING mistral.executors.default_executor [req-a1249437-143f-4cd3-b5cf-3b65a8141a3c b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] The action raised an exception [action_ex_id=4aeac79c-0830-4efc-9cde-2492166b1b87, action_cls='<class 'mistral.actions.action_factory.AnsiblePlaybookAction'>', attributes='{}', params='{u'remote_user': u'tripleo-admin', u'tags': u'', u'become_user': u'root', u'inventory': u'Undercloud:\n hosts:\n undercloud: {}\n vars:\n ansible_connection: local\n ansible_host: localhost\n ansible_remote_tmp: /tmp/ansible-${USER}\n auth_url: https://172.31.0.2:13000/\n cacert: null\n os_auth_token: gAAAAABb9ljM6pq3ld-AsQCJUohZwkd8WzJtS_V4fXNZF0GM_PjsmyfOdoAOi0lHdpZlA9hLxCOvQZLSw1Ytd85q_HoFW0zifPPBsbsc3vyR0ovOQUriS_7cb-z2jWJ-U2YCpn2ppQhOzfvH4GFZUGgcJQYJdYuL5b23ONI3rQoo5EXNYxPWdxE\n overcloud_admin_password: nt4PAG86Wj2XFtyWbhY2attVZ\n overcloud_horizon_url: http://172.21.0.10:80/dashboard\n overcloud_keystone_url: http://172.21.0.10:5000\n plan: overcloud\n project_name: admin\n undercloud_service_list: [openstack-nova-compute, openstack-heat-engine, openstack-ironic-conductor,\n openstack-swift-container, openstack-swift-object, openstack-mistral-engine]\n undercloud_swift_url: https://172.31.0.2:13808/v1/AUTH_fce1111634fc4bd4b4e91236abc9838d\n username: admin\nR620Compute:\n hosts:\n overcloud-620rcompute-0:\n ansible_host: 172.31.0.28\n ctlplane_ip: 172.31.0.28\n deploy_server_id: 32bb82a1-0511-40c5-af67-c279876a2293\n enabled_networks: [management, storage, ctlplane, external, internal_api, storage_mgmt,\n tenant]\n external_ip: 172.31.0.28\n internal_api_ip: 172.26.0.24\n management_ip: 172.31.0.28\n storage_ip: 172.28.0.8\n storage_mgmt_ip: 172.31.0.28\n tenant_ip: 172.27.0.8\n ....more log... Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}.: UnauthorizedError: Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}. 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor Traceback (most recent call last): 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 114, in run_action 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor result = action.run(action_ctx) 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/tripleo_common/actions/ansible.py", line 562, in run 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor self.post_message(queue, ''.join(lines)) 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/tripleo_common/actions/ansible.py", line 431, in post_message 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor queue.post(self.format_message(message_part)) 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/queues.py", line 170, in post 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor self._name, messages) 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/core.py", line 242, in message_post 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor resp = transport.send(request) 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/zaqarclient/transport/http.py", line 114, in send 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor raise self.http_to_zaqar[resp.status_code](**kwargs) 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor UnauthorizedError: Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}. 2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor 2018-11-22 11:21:14.084 1 INFO mistral.executors.executor_server [req-a1249437-143f-4cd3-b5cf-3b65a8141a3c b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Received RPC request 'run_action'[action_ex_id=9597ee02-2f78-42d2-a088-2d04e96c498b, action_cls_str=mistral.actions.std_actions.NoOpAction, action_cls_attrs={}, params={}, timeout=None] ---few more logs--- 2018-11-22 11:21:15.011 1 INFO mistral.executors.executor_server [req-a1249437-143f-4cd3-b5cf-3b65a8141a3c b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Received RPC request 'run_action'[action_ex_id=c5c07141-f33a-49fb-ba72-7514c29cf59d, action_cls_str=mistral.actions.openstack.actions.ZaqarAction, action_cls_attrs={u'client_method_name': u'queue_post'}, params={queue_name: upgrade, messages: {u'body': {u'type': u'tripleo.package_update.v1.update_nodes', u'..., timeout=None] 2018-11-22 11:21:15.014 1 INFO mistral.actions.std_actions [req-aa8744bf-730c-4cc7-a46b-970c93e282e1 b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - - -] Running no-op action 2018-11-22 11:21:15.989 1 WARNING mistral.actions.openstack.base [req-a1249437-143f-4cd3-b5cf-3b65a8141a3c b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 117, in run result = method(**self._kwargs_for_run) File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 472, in wrap return method(client, *args, **kwargs) File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 518, in queue_post return queue.post(messages) File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/queues.py", line 170, in post self._name, messages) File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/core.py", line 242, in message_post resp = transport.send(request) File "/usr/lib/python2.7/site-packages/zaqarclient/transport/http.py", line 114, in send raise self.http_to_zaqar[resp.status_code](**kwargs) UnauthorizedError: Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}. And similar log in mistral engine log: 2018-11-22 11:21:13.991 1 INFO mistral.engine.engine_server [req-e9c6ee09-7468-4b4d-ad3b-f339e0360b7a b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Received RPC request 'on_action_complete'[action_ex_id=4aeac79c-0830-4efc-9cde-2492166b1b87, result=Result [data=None, error=The action raised an exception [action_ex_id=4aeac79c-0830-4efc-9cde-2492166b1b87, action_cls='<clas..., cancel=False]] 2018-11-22 11:21:14.011 1 INFO workflow_trace [req-e9c6ee09-7468-4b4d-ad3b-f339e0360b7a b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Action 'tripleo.ansible-playbook' (4aeac79c-0830-4efc-9cde-2492166b1b87)(task=node_update) [RUNNING -> ERROR, error = The action raised an exception [action_ex_id=4aeac79c-0830-4efc-9cde-2492166b1b87, action_cls='<clas...] 2018-11-22 11:21:14.024 1 INFO workflow_trace [req-e9c6ee09-7468-4b4d-ad3b-f339e0360b7a b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Task 'node_update' (36c11201-6f04-401e-8b10-e375a855421b) [RUNNING -> ERROR, msg=The action raised an exception [action_ex_id=4aeac79c-0830-4efc-9cde-2492166b1b87, action_cls='<class 'mistral.actions.action_factory.AnsiblePlaybookAction'>', attributes='{}', params='{u'remote_user': u'tripleo-admin', u'tags': u'', u'become_user': u'root', u'inventory': u'Undercloud:\n hosts:\n undercloud: {}\n vars:\n ansible_connection: local\n ansible_host: localhost\n ansible_remote_tmp: /tmp/ansible-${USER}\n auth_url: https://172.31.0.2:13000/\n cacert: null\n os_auth_token: gAAAAABb9ljM6pq3ld-AsQCJUohZwkd8WzJtS_V4fXNZF0GM_PjsmyfOdoAOi0lHdpZlA9hLxCOvQZLSw1Ytd85q_HoFW0zifPPBsbsc3vyR0ovOQUriS_7cb-z2jWJ-U2YCpn2ppQhOzfvH4GFZUGgcJQYJdYuL5b23ONI3rQoo5EXNYxPWdxE\n overcloud_admin_password: nt4PAG86Wj2XFtyWbhY2attVZ\n overcloud_horizon_url: http://172.21.0.10:80/dashboard\n overcloud_keystone_url: http://172.21.0.10:5000\n plan: overcloud\n project_name: admin\n undercloud_service_list: [openstack-nova- ---more of this log--- Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}.] (execution_id=af0e9156-1481-4f45-844e-bc9b7303c666) 2018-11-22 11:21:14.095 1 INFO mistral.engine.engine_server [req-e9c6ee09-7468-4b4d-ad3b-f339e0360b7a b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Received RPC request 'report_running_actions'[action_ex_ids=[u'9597ee02-2f78-42d2-a088-2d04e96c498b']] If checking zaqar-server logs at that very same time: 2018-11-22 07:20:53.124 1 INFO zaqar.transport.websocket.protocol [-] Response: API v2 txt, 200. Request: action "authenticate", body {}. 2018-11-22 07:20:54.494 1 INFO zaqar.transport.websocket.protocol [-] Response: API v2 txt, 204. Request: action "queue_create", body {"queue_name": "upgrade"}. 2018-11-22 07:20:54.668 1 INFO zaqar.transport.websocket.protocol [-] Response: API v2 txt, 201. Request: action "subscription_create", body {"queue_name": "upgrade", "ttl": 10000}. 2018-11-22 11:07:10.868 1 INFO zaqar.transport.websocket.protocol [-] WebSocket connection closed: connection was closed uncleanly (peer dropped the TCP connection without previous WebSocket closing handshake) 2018-11-22 16:27:48.374 1 INFO zaqar.transport.websocket.protocol [-] Client connecting: tcp:172.31.0.1:53348 2018-11-22 16:27:48.374 1 INFO zaqar.transport.websocket.protocol [-] WebSocket connection open.
I got also this log in a second retry I ran: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Messages could not be enqueued. Please try again in a few seconds..: ServiceUnavailableError: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Messages could not be enqueued. Please try again in a few seconds.. 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor Traceback (most recent call last): 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 114, in run_action 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor result = action.run(action_ctx) 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/tripleo_common/actions/ansible.py", line 562, in run 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor self.post_message(queue, ''.join(lines)) 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/tripleo_common/actions/ansible.py", line 431, in post_message 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor queue.post(self.format_message(message_part)) 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/queues.py", line 170, in post 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor self._name, messages) 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/core.py", line 242, in message_post 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor resp = transport.send(request) 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/zaqarclient/transport/http.py", line 114, in send 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor raise self.http_to_zaqar[resp.status_code](**kwargs) 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor ServiceUnavailableError: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Messages could not be enqueued. Please try again in a few seconds.. 2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor
(In reply to Jose Luis Franco from comment #8) > I got also this log in a second retry I ran: Anything in the zaqar logs?
The system was running out of space and swift had troubles. This casued zaqar to fail. Some space was freed up and it's being retried again.
The system was running out of space and that caused unexpected behavior in swift which impacted in zaqar. As this was corner case, let's close it and reopen if it's faced again.