Occasionally (atm seems quite rare, did not reproduced again in few runs) `openstack overcloud deploy` command fails after 3 minutes with output: > 2018-03-06 03:05:41.660 8512 WARNING tripleoclient.plugin [ admin] Waiting for messages on queue 'tripleo' with no timeout. > 2018-03-06 03:08:43.103 8512 ERROR openstack [ admin] {u'execution': {u'created_at': u'2018-03-06 08:08:36', > u'id': u'7d1b05de-2c5b-4bbe-a484-1d09887ad03f', > u'input': {u'logging_container': u'tripleo-ui-logs', > u'logging_queue_name': u'tripleo-ui-logging', > u'queue_name': u'tripleo'}, > u'name': u'tripleo.plan_management.v1.publish_ui_logs_to_swift', > u'params': {u'namespace': u''}, > u'spec': {u'description': u'This workflow drains a zaqar queue, and publish its messages into a log file in swift. This workflow is called by cron trigger.\n', > u'input': [{u'queue_name': u'tripleo'}, > {u'logging_queue_name': u'tripleo-ui-logging'}, > {u'logging_container': u'tripleo-ui-logs'}], > u'name': u'publish_ui_logs_to_swift', > u'tags': [u'tripleo-common-managed'], > u'tasks': {u'delete_messages': {u'action': u'zaqar.delete_messages', > u'input': {u'messages': u'<% $.message_ids %>', > u'queue_name': u'<% $.logging_queue_name %>'}, > u'name': u'delete_messages', > u'on-error': u'notify_zaqar', > u'on-success': u'get_messages', > u'publish': {u'status': u'SUCCESS'}, > u'publish-on-error': {u'message': u'<% task().result %>', > u'status': u'FAILED'}, > u'type': u'direct', > u'version': u'2.0'}, > u'format_messages': {u'action': u'tripleo.logging_to_swift.format_messages', > u'input': {u'messages': u'<% $.messages %>'}, > u'name': u'format_messages', > u'on-success': u'upload_to_swift', > u'publish': {u'formatted_messages': u'<% task().result %>', > u'status': u'SUCCESS'}, > u'type': u'direct', > u'version': u'2.0'}, > u'get_messages': {u'action': u'zaqar.claim_messages', > u'input': {u'grace': 60, > u'queue_name': u'<% $.logging_queue_name %>', > u'ttl': 60}, > u'name': u'get_messages', > u'on-error': u'notify_zaqar', > u'on-success': [{u'format_messages': u'<% task().result.len() > 0 %>'}, > {u'notify_zaqar': u'<% task().result.len() = 0 %>'}], > u'publish': {u'message_ids': u'<% task().result.select($._id) %>', > u'messages': u'<% task().result %>', > u'status': u'SUCCESS'}, > u'publish-on-error': {u'message': u'<% task().result %>', > u'status': u'FAILED'}, > u'type': u'direct', > u'version': u'2.0'}, > u'notify_zaqar': {u'action': u'zaqar.queue_post', > u'input': {u'messages': {u'body': {u'payload': {u'execution': u'<% execution() %>', > u'message': u"<% $.get('message', '') %>", > u'status': u'<% $.status %>'}, > u'type': u'tripleo.plan_management.v1.publish_ui_logs_to_swift'}}, > u'queue_name': u'<% $.queue_name %>'}, > u'name': u'notify_zaqar', > u'on-success': [{u'fail': u'<% $.get(\'status\') = "FAILED" %>'}], > u'type': u'direct', > u'version': u'2.0'}, > u'start': {u'name': u'start', > u'on-success': u'get_messages', > u'type': u'direct', > u'version': u'2.0'}, > u'upload_to_swift': {u'action': u'tripleo.logging_to_swift.publish_ui_log_to_swift', > u'input': {u'logging_container': u'<% $.logging_container %>', > u'logging_data': u'<% $.formatted_messages %>'}, > u'name': u'upload_to_swift', > u'on-error': u'notify_zaqar', > u'on-success': u'delete_messages', > u'publish': {u'status': u'SUCCESS'}, > u'publish-on-error': {u'message': u'<% task().result %>', > u'status': u'FAILED'}, > u'type': u'direct', > u'version': u'2.0'}}, > u'version': u'2.0'}, > u'updated_at': u'2018-03-06 08:08:36'}, > u'message': u"Failed to run action [action_ex_id=e7186a9c-07c4-4344-948c-4543e63727be, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'claim_messages'}', params='{u'queue_name': u'tripleo-ui-logging', u'grace': 60, u'ttl': 60}']\n ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Claim could not be created. Please try again in a few seconds..", > u'status': u'FAILED'} Although no error visible in zaqar log (aside one occasion of bug #1538644 ). Could be similar/related to bug #1518221. > python-tripleoclient.noarch 9.1.1-0.20180220030014.59fe75b.el7ost @rhelosp-13.0-puddle > openstack-zaqar.noarch 1:6.0.0-0.20180220112259.c52e8e4.el7ost > puppet-zaqar.noarch 12.3.1-0.20180221093329.8b554f3.el7ost > python2-zaqarclient.noarch 1.9.0-0.20180211232746.c1689bf.el7ost
reproduced on my BM 3 controller 1 compute 3 ceph (core_puddle: 2018-03-02.2): Environment: ------------- openstack-mistral-engine-6.0.0-0.20180222191310.0cea32a.0rc1.el7ost.noarch Steps: Attempt to deploy Overcloud with latest osp13 Results Deployment start and failed after 4 hours: | 011273a2-71e6-4d3a-a419-2b6641435a8a | overcloud | CREATE_FAILED | 2018-02-20T21:17:37Z | None | f89475a3df8544faaababdec38604bbf | | AllNodesDeploySteps | d23842c4-9ea9-44dc-8fe2-ea63a610f2bb | OS::TripleO::PostDeploySteps | CREATE_FAILED | 2018-02-20T21:17:41Z | overcloud | | WorkflowTasks_Step2_Execution | 76d3cf50-578e-49e0-8f87-837006495852 | OS::Mistral::ExternalResource | CREATE_FAILED | 2018-02-20T21:42:28Z | overcloud-AllNodesDeploySteps-iwgm2ix4v6g4 | [root@undercloud75 ~]# grep ERROR /var/log/mistral/executor.log ---------------------------------------------------------------- 2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor [req-6019b2e3-7c27-4c77-8570-bc5caa24ea79 495e807e120342698105ebce42b2a2c7 f89475a3df8544faaababdec38604bbf - default default] Failed to run action [action_ex_id=11bfe43e-7072-4211-8216-3a0a4cbea64b, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'claim_messages'}', params='{u'queue_name': u'tripleo-ui-logging', u'grace': 60, u'ttl': 60}'] 2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor Traceback (most recent call last): 2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 114, in run_action 2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor result = action.run(action_ctx) 2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 130, in run 2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor (self.__class__.__name__, self.client_method_name, str(e)) 2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor ActionException: ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Claim could not be created. Please try again in a few seconds.. 2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor
*** Bug 1552101 has been marked as a duplicate of this bug. ***
Just happened upstream in a check job as well. Added launchpad issue.
In addition to the linked patch, we should also consider disabling mistral's cron tasks in ci.
How can we reproduce this? How should we test?
On the undercloud: $ openstack messaging queue list Make sure that tripleo-ui-logging isn't on the list. If it is, delete it. Close all GUI browser tabs. $ mistral execution-create tripleo.plan_management.v1.publish_ui_logs_to_swift Make sure that the execution results in SUCCESS.
*** Bug 1538644 has been marked as a duplicate of this bug. ***
I have seen this error on a couple of times on fresh deployment of osp13 It seems to happen when we do introspection for the first time. I don't usually see it again afterwards (undercloud) [stack@undercloud ~]$ source /home/stack/stackrc && openstack overcloud node import /home/stack/instackenv.json --provide Started Mistral Workflow tripleo.baremetal.v1.register_or_update. Execution ID: 6fe931bb-ba1e-4ea2-82f7-442d87d1aad0 Waiting for messages on queue 'tripleo' with no timeout. Failed to run action [action_ex_id=9fd7115f-e9fc-4c3c-b97e-05f8116de97b, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'claim_messages'}', params='{u'queue_name': u'tripleo-ui-logging', u'grace': 60, u'ttl': 60}'] ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Claim could not be created. Please try again in a few seconds.. Nodes set to managed. Successfully registered node UUID 542dc1ca-278a-4c21-8a62-1a270a383a3d Successfully registered node UUID d9d27b21-2522-45c1-b8d6-60c3353b16b0 Successfully registered node UUID 14f3427b-0d9c-43ea-ad16-8e8e3b6930f5 Successfully registered node UUID 86464e98-d51d-4462-8075-56259c6d549f Started Mistral Workflow tripleo.baremetal.v1.provide. Execution ID: 0534b1c9-5897-4656-ad01-e5418fc6e03f (undercloud) [stack@undercloud ~]$ openstack messaging queue list +--------------------------------------+ | Name | +--------------------------------------+ | 83f24d2e-c8ed-4bc3-a1fd-0a32f8b2b810 | | tripleo | +--------------------------------------+
These messages/tracebacks appear in more or less regular deployment workflow ... in my case as how infrared drives the deployment ... uc install, import nodes, images, container images, oc deploy ... observing then undercloud-0/var/log/zaqar/zaqar.log or mistral executor logs shows zaqar tracebacks. Based on mlammon's comment 13 it seems even less then whole flow is enough, did not tried to confirm, but could if wanted.
part of a build now
unable to reproduce with : openstack-tripleo-common-8.5.1-0.20180326153322.91f52e9.el7ost.noarch
As there is a lot of confusion on that bug, I'll try to clarify it. The error, for now (until we sync Zaqar queens), will continue to happen in mistral. Namely tripleo.plan_management.v1.publish_ui_logs_to_swift will continue to fail. You should *NOT* check its status as mentioned in #11. The bug here is that the *deploy* failed because of this error. It shouldn't with the linked tripleo-common build.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086