Bug 1552008

Summary: OC deploy fails with ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable.
Product: Red Hat OpenStack Reporter: Pavel Sedlák <psedlak>
Component: openstack-tripleo-commonAssignee: Honza Pokorny <hpokorny>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: urgent Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: agurenko, apannu, aschultz, beth.white, cgoncalves, gfidente, hbrock, hpokorny, johfulto, jomurphy, jrist, jschluet, jslagle, mburns, mcornea, mlammon, ohochman, psedlak, rhel-osp-director-maint, sasha, skatlapa, slinaber, therve, ukalifon, wznoinsk, yprokule
Target Milestone: betaKeywords: Automation, Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-8.5.1-0.20180326153322.91f52e9.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:46:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pavel Sedlák 2018-03-06 10:28:44 UTC
Occasionally (atm seems quite rare, did not reproduced again in few runs)  `openstack overcloud deploy` command fails after 3 minutes
with output:

> 2018-03-06 03:05:41.660 8512 WARNING tripleoclient.plugin [  admin] Waiting for messages on queue 'tripleo' with no timeout.
> 2018-03-06 03:08:43.103 8512 ERROR openstack [  admin] {u'execution': {u'created_at': u'2018-03-06 08:08:36',
>                 u'id': u'7d1b05de-2c5b-4bbe-a484-1d09887ad03f',
>                 u'input': {u'logging_container': u'tripleo-ui-logs',
>                            u'logging_queue_name': u'tripleo-ui-logging',
>                            u'queue_name': u'tripleo'},
>                 u'name': u'tripleo.plan_management.v1.publish_ui_logs_to_swift',
>                 u'params': {u'namespace': u''},
>                 u'spec': {u'description': u'This workflow drains a zaqar queue, and publish its messages into a log file in swift.  This workflow is called by cron trigger.\n',
>                           u'input': [{u'queue_name': u'tripleo'},
>                                      {u'logging_queue_name': u'tripleo-ui-logging'},
>                                      {u'logging_container': u'tripleo-ui-logs'}],
>                           u'name': u'publish_ui_logs_to_swift',
>                           u'tags': [u'tripleo-common-managed'],
>                           u'tasks': {u'delete_messages': {u'action': u'zaqar.delete_messages',
>                                                           u'input': {u'messages': u'<% $.message_ids %>',
>                                                                      u'queue_name': u'<% $.logging_queue_name %>'},
>                                                           u'name': u'delete_messages',
>                                                           u'on-error': u'notify_zaqar',
>                                                           u'on-success': u'get_messages',
>                                                           u'publish': {u'status': u'SUCCESS'},
>                                                           u'publish-on-error': {u'message': u'<% task().result %>',
>                                                                                 u'status': u'FAILED'},
>                                                           u'type': u'direct',
>                                                           u'version': u'2.0'},
>                                      u'format_messages': {u'action': u'tripleo.logging_to_swift.format_messages',
>                                                           u'input': {u'messages': u'<% $.messages %>'},
>                                                           u'name': u'format_messages',
>                                                           u'on-success': u'upload_to_swift',
>                                                           u'publish': {u'formatted_messages': u'<% task().result %>',
>                                                                        u'status': u'SUCCESS'},
>                                                           u'type': u'direct',
>                                                           u'version': u'2.0'},
>                                      u'get_messages': {u'action': u'zaqar.claim_messages',
>                                                        u'input': {u'grace': 60,
>                                                                   u'queue_name': u'<% $.logging_queue_name %>',
>                                                                   u'ttl': 60},
>                                                        u'name': u'get_messages',
>                                                        u'on-error': u'notify_zaqar',
>                                                        u'on-success': [{u'format_messages': u'<% task().result.len() > 0 %>'},
>                                                                        {u'notify_zaqar': u'<% task().result.len() = 0 %>'}],
>                                                        u'publish': {u'message_ids': u'<% task().result.select($._id) %>',
>                                                                     u'messages': u'<% task().result %>',
>                                                                     u'status': u'SUCCESS'},
>                                                        u'publish-on-error': {u'message': u'<% task().result %>',
>                                                                              u'status': u'FAILED'},
>                                                        u'type': u'direct',
>                                                        u'version': u'2.0'},
>                                      u'notify_zaqar': {u'action': u'zaqar.queue_post',
>                                                        u'input': {u'messages': {u'body': {u'payload': {u'execution': u'<% execution() %>',
>                                                                                                        u'message': u"<% $.get('message', '') %>",
>                                                                                                        u'status': u'<% $.status %>'},
>                                                                                           u'type': u'tripleo.plan_management.v1.publish_ui_logs_to_swift'}},
>                                                                   u'queue_name': u'<% $.queue_name %>'},
>                                                        u'name': u'notify_zaqar',
>                                                        u'on-success': [{u'fail': u'<% $.get(\'status\') = "FAILED" %>'}],
>                                                        u'type': u'direct',
>                                                        u'version': u'2.0'},
>                                      u'start': {u'name': u'start',
>                                                 u'on-success': u'get_messages',
>                                                 u'type': u'direct',
>                                                 u'version': u'2.0'},
>                                      u'upload_to_swift': {u'action': u'tripleo.logging_to_swift.publish_ui_log_to_swift',
>                                                           u'input': {u'logging_container': u'<% $.logging_container %>',
>                                                                      u'logging_data': u'<% $.formatted_messages %>'},
>                                                           u'name': u'upload_to_swift',
>                                                           u'on-error': u'notify_zaqar',
>                                                           u'on-success': u'delete_messages',
>                                                           u'publish': {u'status': u'SUCCESS'},
>                                                           u'publish-on-error': {u'message': u'<% task().result %>',
>                                                                                 u'status': u'FAILED'},
>                                                           u'type': u'direct',
>                                                           u'version': u'2.0'}},
>                           u'version': u'2.0'},
>                 u'updated_at': u'2018-03-06 08:08:36'},
>  u'message': u"Failed to run action [action_ex_id=e7186a9c-07c4-4344-948c-4543e63727be, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'claim_messages'}', params='{u'queue_name': u'tripleo-ui-logging', u'grace': 60, u'ttl': 60}']\n ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Claim could not be created. Please try again in a few seconds..",
>  u'status': u'FAILED'}

Although no error visible in zaqar log (aside one occasion of bug #1538644 ).
Could be similar/related to bug #1518221.

> python-tripleoclient.noarch       9.1.1-0.20180220030014.59fe75b.el7ost @rhelosp-13.0-puddle    
> openstack-zaqar.noarch            1:6.0.0-0.20180220112259.c52e8e4.el7ost
> puppet-zaqar.noarch               12.3.1-0.20180221093329.8b554f3.el7ost
> python2-zaqarclient.noarch        1.9.0-0.20180211232746.c1689bf.el7ost

Comment 2 Omri Hochman 2018-03-06 13:18:30 UTC
reproduced on my BM 3 controller 1 compute 3 ceph (core_puddle: 2018-03-02.2): 


Environment:
-------------
openstack-mistral-engine-6.0.0-0.20180222191310.0cea32a.0rc1.el7ost.noarch


Steps:
Attempt to deploy Overcloud with latest osp13

Results
Deployment start and failed after 4 hours: 


| 011273a2-71e6-4d3a-a419-2b6641435a8a | overcloud  | CREATE_FAILED | 2018-02-20T21:17:37Z | None         | f89475a3df8544faaababdec38604bbf |

| AllNodesDeploySteps                       | d23842c4-9ea9-44dc-8fe2-ea63a610f2bb
                                                  | OS::TripleO::PostDeploySteps
   | CREATE_FAILED   | 2018-02-20T21:17:41Z | overcloud
                      |
| WorkflowTasks_Step2_Execution             | 76d3cf50-578e-49e0-8f87-837006495852
                                                  | OS::Mistral::ExternalResource
   | CREATE_FAILED   | 2018-02-20T21:42:28Z | overcloud-AllNodesDeploySteps-iwgm2ix4v6g4
                      |



[root@undercloud75 ~]# grep ERROR /var/log/mistral/executor.log
----------------------------------------------------------------
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor [req-6019b2e3-7c27-4c77-8570-bc5caa24ea79 495e807e120342698105ebce42b2a2c7 f89475a3df8544faaababdec38604bbf - default default] Failed to run action [action_ex_id=11bfe43e-7072-4211-8216-3a0a4cbea64b, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'claim_messages'}', params='{u'queue_name': u'tripleo-ui-logging', u'grace': 60, u'ttl': 60}']
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor Traceback (most recent call last):
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 114, in run_action
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor     result = action.run(action_ctx)
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 130, in run
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor     (self.__class__.__name__, self.client_method_name, str(e))
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor ActionException: ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Claim could not be created. Please try again in a few seconds..
2018-02-21 04:08:52.366 12346 ERROR mistral.executors.default_executor

Comment 5 Giulio Fidente 2018-03-07 13:24:48 UTC
*** Bug 1552101 has been marked as a duplicate of this bug. ***

Comment 6 Alex Schultz 2018-03-07 15:04:31 UTC
Just happened upstream in a check job as well.  Added launchpad issue.

Comment 8 Honza Pokorny 2018-03-09 13:30:23 UTC
In addition to the linked patch, we should also consider disabling mistral's cron tasks in ci.

Comment 9 Udi Kalifon 2018-03-15 15:27:39 UTC
How can we reproduce this? How should we test?

Comment 11 Honza Pokorny 2018-03-15 17:15:12 UTC
On the undercloud:

$ openstack messaging queue list

Make sure that tripleo-ui-logging isn't on the list.  If it is, delete it.  Close all GUI browser tabs.

$ mistral execution-create tripleo.plan_management.v1.publish_ui_logs_to_swift

Make sure that the execution results in SUCCESS.

Comment 12 Honza Pokorny 2018-03-22 09:33:37 UTC
*** Bug 1538644 has been marked as a duplicate of this bug. ***

Comment 13 mlammon 2018-03-22 15:08:33 UTC
I have seen this error on a couple of times on fresh deployment of osp13 
It seems to happen when we do introspection for the first time.
I don't usually see it again afterwards

(undercloud) [stack@undercloud ~]$ source /home/stack/stackrc && openstack overcloud node import /home/stack/instackenv.json --provide
Started Mistral Workflow tripleo.baremetal.v1.register_or_update. Execution ID: 6fe931bb-ba1e-4ea2-82f7-442d87d1aad0
Waiting for messages on queue 'tripleo' with no timeout.

Failed to run action [action_ex_id=9fd7115f-e9fc-4c3c-b97e-05f8116de97b, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'claim_messages'}', params='{u'queue_name': u'tripleo-ui-logging', u'grace': 60, u'ttl': 60}']
 ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Claim could not be created. Please try again in a few seconds..


Nodes set to managed.
Successfully registered node UUID 542dc1ca-278a-4c21-8a62-1a270a383a3d
Successfully registered node UUID d9d27b21-2522-45c1-b8d6-60c3353b16b0
Successfully registered node UUID 14f3427b-0d9c-43ea-ad16-8e8e3b6930f5
Successfully registered node UUID 86464e98-d51d-4462-8075-56259c6d549f
Started Mistral Workflow tripleo.baremetal.v1.provide. Execution ID: 0534b1c9-5897-4656-ad01-e5418fc6e03f


(undercloud) [stack@undercloud ~]$ openstack messaging queue list
+--------------------------------------+
| Name                                 |
+--------------------------------------+
| 83f24d2e-c8ed-4bc3-a1fd-0a32f8b2b810 |
| tripleo                              |
+--------------------------------------+

Comment 14 Pavel Sedlák 2018-03-26 11:05:36 UTC
These messages/tracebacks appear in more or less regular deployment workflow ... in my case as how infrared drives the deployment ... uc install, import nodes, images, container images, oc deploy ... observing then undercloud-0/var/log/zaqar/zaqar.log or mistral executor logs shows zaqar tracebacks.

Based on mlammon's comment 13 it seems even less then whole flow is enough, did not tried to confirm, but could if wanted.

Comment 16 Jon Schlueter 2018-03-29 06:34:12 UTC
part of a build now

Comment 18 Omri Hochman 2018-04-03 19:29:33 UTC
unable to reproduce with :
openstack-tripleo-common-8.5.1-0.20180326153322.91f52e9.el7ost.noarch

Comment 20 Thomas Hervé 2018-04-09 13:55:31 UTC
As there is a lot of confusion on that bug, I'll try to clarify it.

The error, for now (until we sync Zaqar queens), will continue to happen in mistral. Namely tripleo.plan_management.v1.publish_ui_logs_to_swift will continue to fail. You should *NOT* check its status as mentioned in #11.

The bug here is that the *deploy* failed because of this error. It shouldn't with the linked tripleo-common build.

Comment 26 errata-xmlrpc 2018-06-27 13:46:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086