1647514 – [UPGRADES][14] Failed to re-run upgrade: Connection is already closed.: WebSocketConnectionClosedException

Bug 1647514 - [UPGRADES][14] Failed to re-run upgrade: Connection is already closed.: WebSocketConnectionClosedException

Summary: [UPGRADES][14] Failed to re-run upgrade: Connection is already closed.: WebSo...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Emilien Macchi
QA Contact:	Gurenko Alex
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-07 16:27 UTC by Yurii Prokulevych
Modified:	2018-11-27 18:42 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-27 13:41:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Yurii Prokulevych 2018-11-07 16:27:37 UTC

Description of problem:
-----------------------
Attempt to re-run failed upgrade of controller nodes timed out:

openstack overcloud upgrade run \
        --stack overcloud \
        --roles ControllerDeployedServer --playbook all 2>&1
...
2018-11-07 09:15:47.735 160400 INFO osc_lib.shell [-] command: overcloud upgrade run -> tripleoclient.v1.overcloud_upgrade.UpgradeRun (auth=True)
2018-11-07 09:15:47.737 160400 INFO osc_lib.clientmanager [-] Using auth plugin: password
2018-11-07 09:15:47.737 160400 DEBUG osc_lib.clientmanager [-] Using parameters {'username': 'admin', 'project_name': 'admin', 'user_domain_name': 'Default', 'auth_url': 'https://192.168.24.2:13000/', 'password': '***', 'project_domain_name': 'Default'} setup_auth /usr/lib/python2.7/site-packages/osc_lib/clientmanager.py:157
2018-11-07 09:15:47.739 160400 DEBUG osc_lib.clientmanager [-] Get auth_ref auth_ref /usr/lib/python2.7/site-packages/osc_lib/clientmanager.py:201
2018-11-07 09:15:57.552 160400 WARNING tripleoclient.plugin [-] Waiting for messages on queue 'upgrade' with no timeout.
2018-11-07 10:15:56.601 160400 ERROR openstack [-] Connection is already closed.: WebSocketConnectionClosedException: Connection is already closed.
2018-11-07 10:15:56.603 160400 INFO osc_lib.shell [-] END return value: 1


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
python-tripleoclient-10.6.1-0.20181010222404.8c8f259.el7ost.noarch

Steps to Reproduce:
-------------------
1. Upgrade UC to RHOS-14(2018-10-30.6)
2. Prepare playbooks for oc uprade
3. Prepare containers for oc upgrade
4. Make upgrade fail on any controller
5. Re-run upgrade

Actual results:
---------------
Failed to re-run upgrade

Expected results:
-----------------
Able to re-run upgrade


Additional info:
----------------
Virtual setup of pre-previsioned servers - 3controllers + 2computes + 3ceph

Comment 2 Yurii Prokulevych 2018-11-07 16:33:44 UTC

From /var/log/containers/mistral/engine.log:
...
    2018-11-07 10:27:28.794 1 INFO workflow_trace [req-e0566c49-1913-41cf-b482-e3cee408f5d3 - - - - -] Workflow 'tripleo.package_update.v1.update_nodes' [RUNNING -> ERROR, msg=Failed to run task [error=Can not evalu
    ate YAQL expression [expression=execution().name, error=WorkflowExecution not found [id=bb5827dc-7643-49e4-85ec-f89a956c2a47], data={}], wf=tripleo.package_update.v1.update_nodes, task=send_message]:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/site-packages/mistral/engine/task_handler.py", line 63, in run_task
        task.run()
      File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper
        result = f(*args, **kwargs)
      File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 390, in run
        self._run_new()
      File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper
        result = f(*args, **kwargs)
      File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 419, in _run_new
        self._schedule_actions()
      File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 483, in _schedule_actions
        input_dict = self._get_action_input()
      File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper
        result = f(*args, **kwargs)
      File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 514, in _get_action_input
        input_dict = self._evaluate_expression(self.task_spec.get_input(), ctx)
      File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 540, in _evaluate_expression
        ctx_view
      File "/usr/lib/python2.7/site-packages/mistral/expressions/__init__.py", line 100, in evaluate_recursively
        data[key] = _evaluate_item(data[key], context)
      File "/usr/lib/python2.7/site-packages/mistral/expressions/__init__.py", line 79, in _evaluate_item
        return evaluate(item, context)
      File "/usr/lib/python2.7/site-packages/mistral/expressions/__init__.py", line 71, in evaluate
        return evaluator.evaluate(expression, context)
      File "/usr/lib/python2.7/site-packages/mistral/expressions/yaql_expression.py", line 159, in evaluate
        cls).evaluate(trim_expr, data_context)
      File "/usr/lib/python2.7/site-packages/mistral/expressions/yaql_expression.py", line 113, in evaluate
        ", data=%s]" % (expression, str(e), data_context)
    YaqlEvaluationException: Can not evaluate YAQL expression [expression=execution().name, error=WorkflowExecution not found [id=bb5827dc-7643-49e4-85ec-f89a956c2a47], data={}]
    ] (execution_id=bb5827dc-7643-49e4-85ec-f89a956c2a47)

Comment 3 Thomas Hervé 2018-11-09 08:36:54 UTC

Most of the errors reported there are unfortunately unrelated to the problem. The websocket connection error should probably caught and ignored. The "WorkflowExecution not found" is due to the mistral executor being restarted while the ansible action is still running.

The core issue is that the ansible action is not reporting error. Looking at package_update.log which seems to be its output, the last lines say:

TASK [Ensure all online data migrations for Cinder have been applied] 

It's supposed to run "cinder-manage db online_data_migrations" on controller-0, but nothing happens.

Comment 4 Jose Luis Franco 2018-11-16 14:58:58 UTC

We managed to reproduce the same issue running an upgrade from OSP13 to OSP14:

 u'TASK [Ensure all online data migrations for Cinder have been applied] **********',
 u'Friday 16 November 2018  11:18:12 +0000 (0:00:00.393)       0:00:38.246 ******* ',
 u'skipping: [overcloud-controller-2] => {"changed": false, "skip_reason": "Conditional result was False"}',
 u'skipping: [overcloud-controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}']
[u'Failed to update nodes - Controller, please see the logs.']
2018-11-16 13:22:15.373 338774 ERROR openstack [-] Connection is already closed.: WebSocketConnectionClosedException: Connection is already closed.^[[00m
2018-11-16 13:22:15.376 338774 INFO osc_lib.shell [-] END return value: 1^[[00m


When investigating some of the containers in the overcloud controller nodes, it was observed that they lost connectivity with mysql. Also, the cluster was down. This caused the following ansible task to hung for the bootstrap controller node: https://github.com/openstack/tripleo-heat-templates/blob/89d13583f7c706d5c7fbd5052fbedec53474f248/docker/services/cinder-api.yaml#L263 
As ansible-playbook process blocked, mistral ended timeouting and we got the WebSocketConnectionClosedException exception.

So, we should ensure the cluster status is healthy before running the upgrade.

Comment 5 Jose Luis Franco 2018-11-16 15:27:26 UTC

When we tried to bring the cluster back via a start, we observe that some of the resources didn't succeed:

Failed Actions:                                                                                                                                                             
* rabbitmq-bundle-docker-1_monitor_60000 on overcloud-controller-1 'not running' (7): call=84, status=complete, exitreason='',                                              
    last-rc-change='Fri Nov 16 14:53:31 2018', queued=0ms, exec=0ms                                                                                                         
* galera-bundle-docker-1_start_0 on overcloud-controller-1 'unknown error' (1): call=80, status=complete, exitreason='failed to pull image 172.31.0.1:8787/rhosp14/openstack-
mariadb:pcmklatest',                                                                                                                                                        
    last-rc-change='Fri Nov 16 14:52:30 2018', queued=0ms, exec=367ms                                                                                                       
* galera-bundle-docker-2_start_0 on overcloud-controller-1 'unknown error' (1): call=94, status=complete, exitreason='failed to pull image 172.31.0.1:8787/rhosp14/openstack-
mariadb:pcmklatest',                                                                                                                                                        
    last-rc-change='Fri Nov 16 14:52:31 2018', queued=0ms, exec=249ms                                                                                                       
* redis-bundle-docker-1_monitor_60000 on overcloud-controller-1 'not running' (7): call=89, status=complete, exitreason='',                                                 
    last-rc-change='Fri Nov 16 14:53:31 2018', queued=0ms, exec=0ms                                                                                                         
* rabbitmq-bundle-1_monitor_60000 on overcloud-controller-1 'unknown error' (1): call=10, status=Error, exitreason='',                                                      
    last-rc-change='Fri Nov 16 14:53:02 2018', queued=0ms, exec=0ms                                                                                                         
* redis-bundle-1_monitor_60000 on overcloud-controller-1 'unknown error' (1): call=13, status=Error, exitreason='',                                                         
    last-rc-change='Fri Nov 16 14:53:17 2018', queued=0ms, exec=0ms                                                                                                         
* rabbitmq_start_0 on rabbitmq-bundle-0 'unknown error' (1): call=15, status=complete, exitreason='',                                                                       
    last-rc-change='Fri Nov 16 14:52:50 2018', queued=0ms, exec=25797ms                                                                                                     
* rabbitmq-bundle-docker-2_monitor_60000 on overcloud-controller-2 'not running' (7): call=85, status=complete, exitreason='',                                              
    last-rc-change='Fri Nov 16 14:53:31 2018', queued=0ms, exec=0ms                                                                                                         
* galera-bundle-docker-1_start_0 on overcloud-controller-2 'unknown error' (1): call=94, status=complete, exitreason='failed to pull image 172.31.0.1:8787/rhosp14/openstack-
mariadb:pcmklatest',                                                                                                                                                        
    last-rc-change='Fri Nov 16 14:52:31 2018', queued=0ms, exec=277ms                                                                                                       
* galera-bundle-docker-2_start_0 on overcloud-controller-2 'unknown error' (1): call=80, status=complete, exitreason='failed to pull image 172.31.0.1:8787/rhosp14/openstack-
mariadb:pcmklatest',                                                                                                                                                        
    last-rc-change='Fri Nov 16 14:52:30 2018', queued=0ms, exec=333ms                                                                                                       
* redis-bundle-docker-2_monitor_60000 on overcloud-controller-2 'not running' (7): call=90, status=complete, exitreason='',                                                 
    last-rc-change='Fri Nov 16 14:53:31 2018', queued=0ms, exec=0ms
* rabbitmq-bundle-2_monitor_60000 on overcloud-controller-2 'unknown error' (1): call=10, status=Error, exitreason='',                                                      
    last-rc-change='Fri Nov 16 14:53:02 2018', queued=0ms, exec=0ms
* redis-bundle-2_monitor_60000 on overcloud-controller-2 'unknown error' (1): call=13, status=Error, exitreason='',                                                         
    last-rc-change='Fri Nov 16 14:53:17 2018', queued=0ms, exec=0ms

The galera-bundle resources are failing to initiate the mariadb container even though the pcmklatest tagged image is present locally in both controllers:

[root@overcloud-controller-2 ~]# docker image list | grep pcmklatest
172.31.0.1:8787/rhosp14/openstack-cinder-volume               pcmklatest          55d9bcaff11d        9 days ago          1.02 GB                                           
172.31.0.1:8787/rhosp14/openstack-redis                       pcmklatest          ec1e56a5d997        9 days ago          606 MB                                            
172.31.0.1:8787/rhosp14/openstack-rabbitmq                    pcmklatest          a0971285d430        10 days ago         641 MB                                            
172.31.0.1:8787/rhosp14/openstack-haproxy                     pcmklatest          05e996dd8d0e        10 days ago         604 MB                                            
172.31.0.1:8787/rhosp14/openstack-mariadb                     pcmklatest          1b590b262ae3        10 days ago         762 MB        

[root@overcloud-controller-1 ~]# docker image list | grep pcmklatest
172.31.0.1:8787/rhosp14/openstack-cinder-volume               pcmklatest          55d9bcaff11d        9 days ago          1.02 GB                                           
172.31.0.1:8787/rhosp14/openstack-redis                       pcmklatest          ec1e56a5d997        9 days ago          606 MB                                            
172.31.0.1:8787/rhosp14/openstack-rabbitmq                    pcmklatest          a0971285d430        10 days ago         641 MB                                            
172.31.0.1:8787/rhosp14/openstack-haproxy                     pcmklatest          05e996dd8d0e        10 days ago         604 MB                                            
172.31.0.1:8787/rhosp14/openstack-mariadb                     pcmklatest          1b590b262ae3        10 days ago         762 MB                                     

Instead of taking the image which is present locally, it tries to retrieve it from the registry in the undercloud, where it's not present:
(undercloud) [stack@c01-h01-r620 ~]$ docker image list | grep pcmklatest
(undercloud) [stack@c01-h01-r620 ~]$

Comment 6 Jiri Stransky 2018-11-20 15:56:10 UTC

The first upgrade attempt fails on task:

TASK [take new os-net-config parameters into account now]

This probably created a networking issue on the controllers, and right after that this task failed due to machines not being reachable:

2018-11-15 15:00:14,525 p=24270 u=mistral |  TASK [Update all packages] *****************************************************                                                
2018-11-15 15:00:14,525 p=24270 u=mistral |  Thursday 15 November 2018  15:00:14 +0000 (0:00:02.148)       0:29:09.822 *****
2018-11-15 15:06:59,449 p=24270 u=mistral |  fatal: [overcloud-controller-2]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"172.31.0.9\". Make sure this host can be reached over ssh", "unreachable": true}
2018-11-15 15:06:59,557 p=24270 u=mistral |  fatal: [overcloud-controller-1]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"172.31.0.15\". Make sure this host can be reached over ssh", "unreachable": true}
2018-11-15 15:07:17,498 p=24270 u=mistral |  fatal: [overcloud-controller-0]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"172.31.0.14\". Make sure this host can be reached over ssh", "unreachable": true}

This left the overcloud with stopped containers and we proceded retrying with the same command. When the services on the overcloud are already stopped and we want to recover, we must skip validations and pre-upgrade tasks. So we should re-run like this:

openstack overcloud upgrade run --roles Controller --skip-tags validation,pre-upgrade

Comment 7 Jose Luis Franco 2018-11-23 09:29:26 UTC

I got once more this error during the Controller node upgrade run. The log from the command shows:

u'TASK [Update all packages] *****************************************************',
 u'Thursday 22 November 2018  10:00:01 +0000 (0:00:05.012)       2:38:45.134 ***** ',
<LOOONG LOG FOR THE TASK>
2018-11-22 11:07:10.319 839625 ERROR openstack [-] Connection is already closed.: WebSocketConnectionClosedException: Connection is already closed.^[[00m    
2018-11-22 11:07:10.324 839625 INFO osc_lib.shell [-] END return value: 1^[[00m

But, when checking the package_update.log, Ansible continued running tasks much further (I can't paste the log, as it got rewritten by a second upgrade re-try)

Right after 11:07:10.31 we can see the following logs in mistral executor:

2018-11-22 11:21:13.898 1 WARNING mistral.executors.default_executor [req-a1249437-143f-4cd3-b5cf-3b65a8141a3c b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] The action raised an exception [action_ex_id=4aeac79c-0830-4efc-9cde-2492166b1b87, action_cls='<class 'mistral.actions.action_factory.AnsiblePlaybookAction'>', attributes='{}', params='{u'remote_user': u'tripleo-admin', u'tags': u'', u'become_user': u'root', u'inventory': u'Undercloud:\n  hosts:\n    undercloud: {}\n  vars:\n    ansible_connection: local\n    ansible_host: localhost\n    ansible_remote_tmp: /tmp/ansible-${USER}\n    auth_url: https://172.31.0.2:13000/\n    cacert: null\n    os_auth_token: gAAAAABb9ljM6pq3ld-AsQCJUohZwkd8WzJtS_V4fXNZF0GM_PjsmyfOdoAOi0lHdpZlA9hLxCOvQZLSw1Ytd85q_HoFW0zifPPBsbsc3vyR0ovOQUriS_7cb-z2jWJ-U2YCpn2ppQhOzfvH4GFZUGgcJQYJdYuL5b23ONI3rQoo5EXNYxPWdxE\n    overcloud_admin_password: nt4PAG86Wj2XFtyWbhY2attVZ\n    overcloud_horizon_url: http://172.21.0.10:80/dashboard\n    overcloud_keystone_url: http://172.21.0.10:5000\n    plan: overcloud\n    project_name: admin\n    undercloud_service_list: [openstack-nova-compute, openstack-heat-engine, openstack-ironic-conductor,\n      openstack-swift-container, openstack-swift-object, openstack-mistral-engine]\n    undercloud_swift_url: https://172.31.0.2:13808/v1/AUTH_fce1111634fc4bd4b4e91236abc9838d\n    username: admin\nR620Compute:\n  hosts:\n    overcloud-620rcompute-0:\n      ansible_host: 172.31.0.28\n      ctlplane_ip: 172.31.0.28\n      deploy_server_id: 32bb82a1-0511-40c5-af67-c279876a2293\n      enabled_networks: [management, storage, ctlplane, external, internal_api, storage_mgmt,\n        tenant]\n      external_ip: 172.31.0.28\n      internal_api_ip: 172.26.0.24\n      management_ip: 172.31.0.28\n      storage_ip: 172.28.0.8\n      storage_mgmt_ip: 172.31.0.28\n      tenant_ip: 172.27.0.8\n
....more log...
 Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}.: UnauthorizedError: Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}.
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor Traceback (most recent call last):
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 114, in run_action
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor     result = action.run(action_ctx)
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/tripleo_common/actions/ansible.py", line 562, in run
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor     self.post_message(queue, ''.join(lines))
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/tripleo_common/actions/ansible.py", line 431, in post_message
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor     queue.post(self.format_message(message_part))
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/queues.py", line 170, in post 
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor     self._name, messages)
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/core.py", line 242, in message_post
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor     resp = transport.send(request)
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/zaqarclient/transport/http.py", line 114, in send
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor     raise self.http_to_zaqar[resp.status_code](**kwargs)
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor UnauthorizedError: Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}.
2018-11-22 11:21:13.898 1 ERROR mistral.executors.default_executor
2018-11-22 11:21:14.084 1 INFO mistral.executors.executor_server [req-a1249437-143f-4cd3-b5cf-3b65a8141a3c b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Received RPC request 'run_action'[action_ex_id=9597ee02-2f78-42d2-a088-2d04e96c498b, action_cls_str=mistral.actions.std_actions.NoOpAction, action_cls_attrs={}, params={}, timeout=None]
---few more logs---
2018-11-22 11:21:15.011 1 INFO mistral.executors.executor_server [req-a1249437-143f-4cd3-b5cf-3b65a8141a3c b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Received RPC request 'run_action'[action_ex_id=c5c07141-f33a-49fb-ba72-7514c29cf59d, action_cls_str=mistral.actions.openstack.actions.ZaqarAction, action_cls_attrs={u'client_method_name': u'queue_post'}, params={queue_name: upgrade, messages: {u'body': {u'type': u'tripleo.package_update.v1.update_nodes', u'..., timeout=None]
2018-11-22 11:21:15.014 1 INFO mistral.actions.std_actions [req-aa8744bf-730c-4cc7-a46b-970c93e282e1 b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - - -] Running no-op action
2018-11-22 11:21:15.989 1 WARNING mistral.actions.openstack.base [req-a1249437-143f-4cd3-b5cf-3b65a8141a3c b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 117, in run
    result = method(**self._kwargs_for_run)
  File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 472, in wrap
    return method(client, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 518, in queue_post
    return queue.post(messages)
  File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/queues.py", line 170, in post
    self._name, messages)
  File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/core.py", line 242, in message_post
    resp = transport.send(request)
  File "/usr/lib/python2.7/site-packages/zaqarclient/transport/http.py", line 114, in send
    raise self.http_to_zaqar[resp.status_code](**kwargs)
UnauthorizedError: Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}.

And similar log in mistral engine log:

2018-11-22 11:21:13.991 1 INFO mistral.engine.engine_server [req-e9c6ee09-7468-4b4d-ad3b-f339e0360b7a b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Received RPC request 'on_action_complete'[action_ex_id=4aeac79c-0830-4efc-9cde-2492166b1b87, result=Result [data=None, error=The action raised an exception [action_ex_id=4aeac79c-0830-4efc-9cde-2492166b1b87, action_cls='<clas..., cancel=False]]
2018-11-22 11:21:14.011 1 INFO workflow_trace [req-e9c6ee09-7468-4b4d-ad3b-f339e0360b7a b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Action 'tripleo.ansible-playbook' (4aeac79c-0830-4efc-9cde-2492166b1b87)(task=node_update) [RUNNING -> ERROR, error = The action raised an exception [action_ex_id=4aeac79c-0830-4efc-9cde-2492166b1b87, action_cls='<clas...]
2018-11-22 11:21:14.024 1 INFO workflow_trace [req-e9c6ee09-7468-4b4d-ad3b-f339e0360b7a b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Task 'node_update' (36c11201-6f04-401e-8b10-e375a855421b) [RUNNING -> ERROR, msg=The action raised an exception [action_ex_id=4aeac79c-0830-4efc-9cde-2492166b1b87, action_cls='<class 'mistral.actions.action_factory.AnsiblePlaybookAction'>', attributes='{}', params='{u'remote_user': u'tripleo-admin', u'tags': u'', u'become_user': u'root', u'inventory': u'Undercloud:\n  hosts:\n    undercloud: {}\n  vars:\n    ansible_connection: local\n    ansible_host: localhost\n    ansible_remote_tmp: /tmp/ansible-${USER}\n    auth_url: https://172.31.0.2:13000/\n    cacert: null\n    os_auth_token: gAAAAABb9ljM6pq3ld-AsQCJUohZwkd8WzJtS_V4fXNZF0GM_PjsmyfOdoAOi0lHdpZlA9hLxCOvQZLSw1Ytd85q_HoFW0zifPPBsbsc3vyR0ovOQUriS_7cb-z2jWJ-U2YCpn2ppQhOzfvH4GFZUGgcJQYJdYuL5b23ONI3rQoo5EXNYxPWdxE\n    overcloud_admin_password: nt4PAG86Wj2XFtyWbhY2attVZ\n    overcloud_horizon_url: http://172.21.0.10:80/dashboard\n    overcloud_keystone_url: http://172.21.0.10:5000\n    plan: overcloud\n    project_name: admin\n    undercloud_service_list: [openstack-nova-
---more of this log---
Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}.] (execution_id=af0e9156-1481-4f45-844e-bc9b7303c666)
2018-11-22 11:21:14.095 1 INFO mistral.engine.engine_server [req-e9c6ee09-7468-4b4d-ad3b-f339e0360b7a b6dab204393e4546b90cfa7de1abbae0 fce1111634fc4bd4b4e91236abc9838d - default default] Received RPC request 'report_running_actions'[action_ex_ids=[u'9597ee02-2f78-42d2-a088-2d04e96c498b']]

If checking zaqar-server logs at that very same time:

2018-11-22 07:20:53.124 1 INFO zaqar.transport.websocket.protocol [-] Response: API v2 txt, 200. Request: action "authenticate", body {}.
2018-11-22 07:20:54.494 1 INFO zaqar.transport.websocket.protocol [-] Response: API v2 txt, 204. Request: action "queue_create", body {"queue_name": "upgrade"}.
2018-11-22 07:20:54.668 1 INFO zaqar.transport.websocket.protocol [-] Response: API v2 txt, 201. Request: action "subscription_create", body {"queue_name": "upgrade", "ttl": 10000}.
2018-11-22 11:07:10.868 1 INFO zaqar.transport.websocket.protocol [-] WebSocket connection closed: connection was closed uncleanly (peer dropped the TCP connection without previous WebSocket closing handshake)
2018-11-22 16:27:48.374 1 INFO zaqar.transport.websocket.protocol [-] Client connecting: tcp:172.31.0.1:53348
2018-11-22 16:27:48.374 1 INFO zaqar.transport.websocket.protocol [-] WebSocket connection open.

Comment 8 Jose Luis Franco 2018-11-23 16:18:07 UTC

I got also this log in a second retry I ran:

Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Messages could not be enqueued. Please try again in a few seconds..: ServiceUnavailableError: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Messages could not be enqueued. Please try again in a few seconds..
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor Traceback (most recent call last):
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 114, in run_action
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor     result = action.run(action_ctx)
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/tripleo_common/actions/ansible.py", line 562, in run
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor     self.post_message(queue, ''.join(lines))
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/tripleo_common/actions/ansible.py", line 431, in post_message
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor     queue.post(self.format_message(message_part))
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/queues.py", line 170, in post
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor     self._name, messages)
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/core.py", line 242, in message_post
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor     resp = transport.send(request)
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/zaqarclient/transport/http.py", line 114, in send
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor     raise self.http_to_zaqar[resp.status_code](**kwargs)
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor ServiceUnavailableError: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Messages could not be enqueued. Please try again in a few seconds..
2018-11-23 14:06:10.200 1 ERROR mistral.executors.default_executor

Comment 9 Thomas Hervé 2018-11-23 17:08:00 UTC

(In reply to Jose Luis Franco from comment #8)
> I got also this log in a second retry I ran:

Anything in the zaqar logs?

Comment 10 Jose Luis Franco 2018-11-26 09:58:17 UTC

The system was running out of space and swift had troubles. This casued zaqar to fail. 
Some space was freed up and it's being retried again.

Comment 11 Jose Luis Franco 2018-11-27 13:41:01 UTC

The system was running out of space and that caused unexpected behavior in swift which impacted in zaqar. As this was corner case, let's close it and reopen if it's faced again.

Note You need to log in before you can comment on or make changes to this bug.