Bug 1552008 - OC deploy fails with ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable.
Summary: OC deploy fails with ZaqarAction.claim_messages failed: Error response from Z...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: beta
: 13.0 (Queens)
Assignee: Honza Pokorny
QA Contact: Yogev Rabl
URL:
Whiteboard:
: 1538644 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-06 10:28 UTC by Pavel Sedlák
Modified: 2018-08-03 18:32 UTC (History)
26 users (show)

Fixed In Version: openstack-tripleo-common-8.5.1-0.20180326153322.91f52e9.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 13:46:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1754061 0 None None None 2018-03-07 15:04:30 UTC
OpenStack gerrit 552452 0 None MERGED Don't notify zaqar in publish_ui_logs_to_swift 2020-11-05 06:30:51 UTC
Red Hat Product Errata RHEA-2018:2086 0 None None None 2018-06-27 13:48:04 UTC

Description Pavel Sedlák 2018-03-06 10:28:44 UTC
Occasionally (atm seems quite rare, did not reproduced again in few runs)  `openstack overcloud deploy` command fails after 3 minutes
with output:

> 2018-03-06 03:05:41.660 8512 WARNING tripleoclient.plugin [  admin] Waiting for messages on queue 'tripleo' with no timeout.
> 2018-03-06 03:08:43.103 8512 ERROR openstack [  admin] {u'execution': {u'created_at': u'2018-03-06 08:08:36',
>                 u'id': u'7d1b05de-2c5b-4bbe-a484-1d09887ad03f',
>                 u'input': {u'logging_container': u'tripleo-ui-logs',
>                            u'logging_queue_name': u'tripleo-ui-logging',
>                            u'queue_name': u'tripleo'},
>                 u'name': u'tripleo.plan_management.v1.publish_ui_logs_to_swift',
>                 u'params': {u'namespace': u''},
>                 u'spec': {u'description': u'This workflow drains a zaqar queue, and publish its messages into a log file in swift.  This workflow is called by cron trigger.\n',
>                           u'input': [{u'queue_name': u'tripleo'},
>                                      {u'logging_queue_name': u'tripleo-ui-logging'},
>                                      {u'logging_container': u'tripleo-ui-logs'}],
>                           u'name': u'publish_ui_logs_to_swift',
>                           u'tags': [u'tripleo-common-managed'],
>                           u'tasks': {u'delete_messages': {u'action': u'zaqar.delete_messages',
>                                                           u'input': {u'messages': u'<% $.message_ids %>',
>                                                                      u'queue_name': u'<% $.logging_queue_name %>'},
>                                                           u'name': u'delete_messages',
>                                                           u'on-error': u'notify_zaqar',
>                                                           u'on-success': u'get_messages',
>                                                           u'publish': {u'status': u'SUCCESS'},
>                                                           u'publish-on-error': {u'message': u'<% task().result %>',
>                                                                                 u'status': u'FAILED'},
>                                                           u'type': u'direct',
>                                                           u'version': u'2.0'},
>                                      u'format_messages': {u'action': u'tripleo.logging_to_swift.format_messages',
>                                                           u'input': {u'messages': u'<% $.messages %>'},
>                                                           u'name': u'format_messages',
>                                                           u'on-success': u'upload_to_swift',
>                                                           u'publish': {u'formatted_messages': u'<% task().result %>',
>                                                                        u'status': u'SUCCESS'},
>                                                           u'type': u'direct',
>                                                           u'version': u'2.0'},
>                                      u'get_messages': {u'action': u'zaqar.claim_messages',
>                                                        u'input': {u'grace': 60,
>                                                                   u'queue_name': u'<% $.logging_queue_name %>',
>                                                                   u'ttl': 60},
>                                                        u'name': u'get_messages',
>                                                        u'on-error': u'notify_zaqar',
>                                                        u'on-success': [{u'format_messages': u'<% task().result.len() > 0 %>'},
>                                                                        {u'notify_zaqar': u'<% task().result.len() = 0 %>'}],
>                                                        u'publish': {u'message_ids': u'<% task().result.select($._id) %>',
>                                                                     u'messages': u'<% task().result %>',
>                                                                     u'status': u'SUCCESS'},
>                                                        u'publish-on-error': {u'message': u'<% task().result %>',
>                                                                              u'status': u'FAILED'},
>                                                        u'type': u'direct',
>                                                        u'version': u'2.0'},
>                                      u'notify_zaqar': {u'action': u'zaqar.queue_post',
>                                                        u'input': {u'messages': {u'body': {u'payload': {u'execution': u'<% execution() %>',
>                                                                                                        u'message': u"<% $.get('message', '') %>",
>                                                                                                        u'status': u'<% $.status %>'},
>                                                                                           u'type': u'tripleo.plan_management.v1.publish_ui_logs_to_swift'}},
>                                                                   u'queue_name': u'<% $.queue_name %>'},
>                                                        u'name': u'notify_zaqar',
>                                                        u'on-success': [{u'fail': u'<% $.get(\'status\') = "FAILED" %>'}],
>                                                        u'type': u'direct',
>                                                        u'version': u'2.0'},
>                                      u'start': {u'name': u'start',
>                                                 u'on-success': u'get_messages',
>                                                 u'type': u'direct',
>                                                 u'version': u'2.0'},
>                                      u'upload_to_swift': {u'action': u'tripleo.logging_to_swift.publish_ui_log_to_swift',
>                                                           u'input': {u'logging_container': u'<% $.logging_container %>',
>                                                                      u'logging_data': u'<% $.formatted_messages %>'},
>                                                           u'name': u'upload_to_swift',
>                                                           u'on-error': u'notify_zaqar',
>                                                           u'on-success': u'delete_messages',
>                                                           u'publish': {u'status': u'SUCCESS'},
>                                                           u'publish-on-error': {u'message': u'<% task().result %>',
>                                                                                 u'status': u'FAILED'},
>                                                           u'type': u'direct',
>                                                           u'version': u'2.0'}},
>                           u'version': u'2.0'},
>                 u'updated_at': u'2018-03-06 08:08:36'},
>  u'message': u"Failed to run action [action_ex_id=e7186a9c-07c4-4344-948c-4543e63727be, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'claim_messages'}', params='{u'queue_name': u'tripleo-ui-logging', u'grace': 60, u'ttl': 60}']\n ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Claim could not be created. Please try again in a few seconds..",
>  u'status': u'FAILED'}

Although no error visible in zaqar log (aside one occasion of bug #1538644 ).
Could be similar/related to bug #1518221.

> python-tripleoclient.noarch       9.1.1-0.20180220030014.59fe75b.el7ost @rhelosp-13.0-puddle    
> openstack-zaqar.noarch            1:6.0.0-0.20180220112259.c52e8e4.el7ost
> puppet-zaqar.noarch               12.3.1-0.20180221093329.8b554f3.el7ost
> python2-zaqarclient.noarch        1.9.0-0.20180211232746.c1689bf.el7ost

Comment 2 Omri Hochman 2018-03-06 13:18:30 UTC
reproduced on my BM 3 controller 1 compute 3 ceph (core_puddle: 2018-03-02.2): 


Environment:
-------------
openstack-mistral-engine-6.0.0-0.20180222191310.0cea32a.0rc1.el7ost.noarch


Steps:
Attempt to deploy Overcloud with latest osp13

Results
Deployment start and failed after 4 hours: 


| 011273a2-71e6-4d3a-a419-2b6641435a8a | overcloud  | CREATE_FAILED | 2018-02-20T21:17:37Z | None         | f89475a3df8544faaababdec38604bbf |

| AllNodesDeploySteps                       | d23842c4-9ea9-44dc-8fe2-ea63a610f2bb
                                                  | OS::TripleO::PostDeploySteps
   | CREATE_FAILED   | 2018-02-20T21:17:41Z | overcloud
                      |
| WorkflowTasks_Step2_Execution             | 76d3cf50-578e-49e0-8f87-837006495852
                                                  | OS::Mistral::ExternalResource
   | CREATE_FAILED   | 2018-02-20T21:42:28Z | overcloud-AllNodesDeploySteps-iwgm2ix4v6g4
                      |



[root@undercloud75 ~]# grep ERROR /var/log/mistral/executor.log
----------------------------------------------------------------
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor [req-6019b2e3-7c27-4c77-8570-bc5caa24ea79 495e807e120342698105ebce42b2a2c7 f89475a3df8544faaababdec38604bbf - default default] Failed to run action [action_ex_id=11bfe43e-7072-4211-8216-3a0a4cbea64b, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'claim_messages'}', params='{u'queue_name': u'tripleo-ui-logging', u'grace': 60, u'ttl': 60}']
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor Traceback (most recent call last):
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 114, in run_action
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor     result = action.run(action_ctx)
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 130, in run
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor     (self.__class__.__name__, self.client_method_name, str(e))
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor ActionException: ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Claim could not be created. Please try again in a few seconds..
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor

Comment 5 Giulio Fidente 2018-03-07 13:24:48 UTC
*** Bug 1552101 has been marked as a duplicate of this bug. ***

Comment 6 Alex Schultz 2018-03-07 15:04:31 UTC
Just happened upstream in a check job as well.  Added launchpad issue.

Comment 8 Honza Pokorny 2018-03-09 13:30:23 UTC
In addition to the linked patch, we should also consider disabling mistral's cron tasks in ci.

Comment 9 Udi Kalifon 2018-03-15 15:27:39 UTC
How can we reproduce this? How should we test?

Comment 11 Honza Pokorny 2018-03-15 17:15:12 UTC
On the undercloud:

$ openstack messaging queue list

Make sure that tripleo-ui-logging isn't on the list.  If it is, delete it.  Close all GUI browser tabs.

$ mistral execution-create tripleo.plan_management.v1.publish_ui_logs_to_swift

Make sure that the execution results in SUCCESS.

Comment 12 Honza Pokorny 2018-03-22 09:33:37 UTC
*** Bug 1538644 has been marked as a duplicate of this bug. ***

Comment 13 mlammon 2018-03-22 15:08:33 UTC
I have seen this error on a couple of times on fresh deployment of osp13 
It seems to happen when we do introspection for the first time.
I don't usually see it again afterwards

(undercloud) [stack@undercloud ~]$ source /home/stack/stackrc && openstack overcloud node import /home/stack/instackenv.json --provide
Started Mistral Workflow tripleo.baremetal.v1.register_or_update. Execution ID: 6fe931bb-ba1e-4ea2-82f7-442d87d1aad0
Waiting for messages on queue 'tripleo' with no timeout.

Failed to run action [action_ex_id=9fd7115f-e9fc-4c3c-b97e-05f8116de97b, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'claim_messages'}', params='{u'queue_name': u'tripleo-ui-logging', u'grace': 60, u'ttl': 60}']
 ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Claim could not be created. Please try again in a few seconds..


Nodes set to managed.
Successfully registered node UUID 542dc1ca-278a-4c21-8a62-1a270a383a3d
Successfully registered node UUID d9d27b21-2522-45c1-b8d6-60c3353b16b0
Successfully registered node UUID 14f3427b-0d9c-43ea-ad16-8e8e3b6930f5
Successfully registered node UUID 86464e98-d51d-4462-8075-56259c6d549f
Started Mistral Workflow tripleo.baremetal.v1.provide. Execution ID: 0534b1c9-5897-4656-ad01-e5418fc6e03f


(undercloud) [stack@undercloud ~]$ openstack messaging queue list
+--------------------------------------+
| Name                                 |
+--------------------------------------+
| 83f24d2e-c8ed-4bc3-a1fd-0a32f8b2b810 |
| tripleo                              |
+--------------------------------------+

Comment 14 Pavel Sedlák 2018-03-26 11:05:36 UTC
These messages/tracebacks appear in more or less regular deployment workflow ... in my case as how infrared drives the deployment ... uc install, import nodes, images, container images, oc deploy ... observing then undercloud-0/var/log/zaqar/zaqar.log or mistral executor logs shows zaqar tracebacks.

Based on mlammon's comment 13 it seems even less then whole flow is enough, did not tried to confirm, but could if wanted.

Comment 16 Jon Schlueter 2018-03-29 06:34:12 UTC
part of a build now

Comment 18 Omri Hochman 2018-04-03 19:29:33 UTC
unable to reproduce with :
openstack-tripleo-common-8.5.1-0.20180326153322.91f52e9.el7ost.noarch

Comment 20 Thomas Hervé 2018-04-09 13:55:31 UTC
As there is a lot of confusion on that bug, I'll try to clarify it.

The error, for now (until we sync Zaqar queens), will continue to happen in mistral. Namely tripleo.plan_management.v1.publish_ui_logs_to_swift will continue to fail. You should *NOT* check its status as mentioned in #11.

The bug here is that the *deploy* failed because of this error. It shouldn't with the linked tripleo-common build.

Comment 26 errata-xmlrpc 2018-06-27 13:46:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.