Bug 1381644 - Broken Pipe error for Zaqar, which causes Mistral failures
Summary: Broken Pipe error for Zaqar, which causes Mistral failures
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-zaqar
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: z4
: 10.0 (Newton)
Assignee: Thomas Hervé
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-04 15:49 UTC by Dan Macpherson
Modified: 2017-09-06 17:06 UTC (History)
11 users (show)

Fixed In Version: openstack-zaqar-3.0.0-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-06 17:06:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Output from "mistral execution-get" (15.01 KB, text/plain)
2016-10-04 15:50 UTC, Dan Macpherson
no flags Details


Links
System ID Private Priority Status Summary Last Updated
RDO 4941 0 None None None 2017-02-16 15:16:40 UTC
RDO 7184 0 None None None 2017-06-22 14:10:34 UTC
Red Hat Product Errata RHBA-2017:2653 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 Bug Fix and Enhancement Advisory 2017-09-06 20:54:38 UTC

Description Dan Macpherson 2016-10-04 15:49:05 UTC
Description of problem:

I have an existing OSP10 overcloud stack and ran "openstack overcloud deploy" to modify it (testing Controller replacement). However, it comes to the part where it removes the existing plan and uploads the new plan via the tripleo.plan_management.v1.update_deployment_plan execution. The execution fails and "mistral execution-get" reports a traceback in the "State info" field. The traceback ends with the following:

YaqlEvaluationException: Can not evaluate YAQL expression:  $.status , data = {}; error: u'status'


I'll attach a copy of the "mistral execution-get" output to this BZ

Version-Release number of selected component (if applicable):
OSP 10

openstack-mistral-api.noarch 3.0.0-0.20160929083341.c0a4501.el7ost
openstack-mistral-common.noarch 3.0.0-0.20160929083341.c0a4501.el7ost
openstack-mistral-engine.noarch 3.0.0-0.20160929083341.c0a4501.el7ost
openstack-mistral-executor.noarch 3.0.0-0.20160929083341.c0a4501.el7ost
puppet-mistral.noarch 9.4.0-0.20160928210707.3738e95.el7ost
python-mistral-tests.noarch 3.0.0-0.20160929083341.c0a4501.el7ost
python-mistralclient.noarch 2.1.1-0.20160915221844.a37b5fe.el7ost
python-openstack-mistral.noarch 3.0.0-0.20160929083341.c0a4501.el7ost

How reproducible:
Always

Steps to Reproduce:
1. Run "openstack overcloud deploy" and wait for a successful deployment
2. Update the Heat templates
3. Rerun "openstack overcloud deploy" and wait for the tripleo.plan_management.v1.update_deployment_plan to fail

Actual results:


Expected results:


Additional info:

Comment 1 Dan Macpherson 2016-10-04 15:50:43 UTC
Created attachment 1207255 [details]
Output from "mistral execution-get"

Comment 2 Dan Macpherson 2016-10-04 16:08:05 UTC
Had a look at mistral's executor.log and found this around the same time as the execution attempt:

2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor [-] Failed to run action [action_ex_id=277e5da7-701e-4b15-add0-690e45d99e2a, 
action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'queue_post'}', params='{u'queue_name': 
u'8d8c79ec-198e-4425-baf3-ceeccd5e7dfa', u'messages': {u'body': {u'type': u'tripleo.plan_management.v1.update_deployment_plan', u'payload': {u's
tatus': u'SUCCESS', u'message': u'', u'execution': {u'input': {u'queue_name': u'8d8c79ec-198e-4425-baf3-ceeccd5e7dfa', u'container': u'overcloud
'}, u'spec': {u'input': [u'container', {u'queue_name': u'tripleo'}], u'tasks': {u'update_plan': {u'name': u'update_plan', u'on-error': u'update_
plan_set_status_failed', u'on-success': u'ensure_passwords_exist', u'version': u'2.0', u'action': u'tripleo.plan.update container=<% $.container
 %>', u'type': u'direct'}, u'ensure_passwords_exist': {u'name': u'ensure_passwords_exist', u'on-error': u'ensure_passwords_exist_set_status_fail
ed', u'on-success': u'process_templates', u'version': u'2.0', u'action': u'tripleo.parameters.generate_passwords container=<% $.container %>', u
'type': u'direct'}, u'set_status_success': {u'version': u'2.0', u'type': u'direct', u'name': u'set_status_success', u'publish': {u'status': u'SU
CCESS', u'message': u'<% task(update_plan).result %>'}, u'on-success': u'notify_zaqar'}, u'update_plan_set_status_failed': {u'version': u'2.0', 
u'type': u'direct', u'name': u'update_plan_set_status_failed', u'publish': {u'status': u'FAILED', u'message': u'<% task(update_plan).result %>'}
, u'on-success': u'notify_zaqar'}, u'process_templates_set_status_failed': {u'version': u'2.0', u'type': u'direct', u'name': u'process_templates
_set_status_failed', u'publish': {u'status': u'FAILED', u'message': u'<% task(process_templates).result %>'}, u'on-success': u'notify_zaqar'}, u
'notify_zaqar': {u'retry': u'count=5 delay=1', u'name': u'notify_zaqar', u'version': u'2.0', u'action': u'zaqar.queue_post', u'input': {u'queue_
name': u'<% $.queue_name %>', u'messages': {u'body': {u'type': u'tripleo.plan_management.v1.update_deployment_plan', u'payload': {u'status': u'<
% $.status %>', u'message': u"<% $.message or '' %>", u'execution': u'<% execution() %>'}}}}, u'type': u'direct'}, u'process_templates': {u'name
': u'process_templates', u'on-error': u'process_templates_set_status_failed', u'on-success': u'set_status_success', u'version': u'2.0', u'action
': u'tripleo.templates.process container=<% $.container %>', u'type': u'direct'}, u'ensure_passwords_exist_set_status_failed': {u'version': u'2.
0', u'type': u'direct', u'name': u'ensure_passwords_exist_set_status_failed', u'publish': {u'status': u'FAILED', u'message': u'<% task(ensure_pa
sswords_exist).result %>'}, u'on-success': u'notify_zaqar'}}, u'name': u'update_deployment_plan', u'version': u'2.0'}, u'params': {}, u'id': u'94912446-aabf-43b4-a7ed-8622e68721db', u'name': u'tripleo.plan_management.v1.update_deployment_plan'}}}}}']
 ZaqarAction.queue_post failed: <class 'requests.exceptions.ConnectionError'>: HTTPConnectionPool(host='192.168.201.5', port=8888): Max retries exceeded with url: /v2/queues/8d8c79ec-198e-4425-baf3-ceeccd5e7dfa/messages (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x6d90dd0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor Traceback (most recent call last):
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor   File "/usr/lib/python2.7/site-packages/mistral/engine/default_executor.py", line 90, in run_action
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor     result = action.run()
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor   File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 142, in run
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor     (self.__class__.__name__, self.client_method_name, e_str)
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor ActionException: ZaqarAction.queue_post failed: <class 'requests.exceptions.ConnectionError'>: HTTPConnectionPool(host='192.168.201.5', port=8888): Max retries exceeded with url: /v2/queues/8d8c79ec-198e-4425-baf3-ceeccd5e7dfa/messages (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x6d90dd0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor 

It seems to be an issue with connecting to zaqar. I restarted both the mistral and zaqar services and the deployment went ahead with not problems.

Comment 3 Dan Macpherson 2016-10-04 16:12:48 UTC
Okay, this appears to be a problem with zaqar. I experienced a Broken Pipe error. I've noticed several Broken Pipe errors at certain points, which explains why I've been having other deployment issues with OSP 10.

All the broken pipe errors follow the same format: 

1) Waring: Failed to increment the message counter for queue 
2) Broken Pipe

For example:

2016-10-04 15:20:11.532 1404 WARNING zaqar.storage.mongodb.queues [(None,) fa56eb6a60fb4291a0242e482b8d7c0d e1b8d2ae29b947cc946383a45f5a680c - -
 -] [project_id:e1b8d2ae29b947cc946383a45f5a680c] Failed to increment the message counter for queue a585b0fc-8a93-441f-8e16-7c8218f34190 and pro
ject e1b8d2ae29b947cc946383a45f5a680c
2016-10-04 15:20:15.405 1404 CRITICAL zaqar [(None,) fa56eb6a60fb4291a0242e482b8d7c0d e1b8d2ae29b947cc946383a45f5a680c - - -] [project_id:e1b8d2
ae29b947cc946383a45f5a680c] IOError: [Errno 32] Broken pipe
2016-10-04 15:20:15.405 1404 ERROR zaqar Traceback (most recent call last):
2016-10-04 15:20:15.405 1404 ERROR zaqar   File "/usr/bin/zaqar-server", line 10, in <module>
2016-10-04 15:20:15.405 1404 ERROR zaqar     sys.exit(run())
2016-10-04 15:20:15.405 1404 ERROR zaqar   File "/usr/lib/python2.7/site-packages/zaqar/common/cli.py", line 58, in _wrapper
2016-10-04 15:20:15.405 1404 ERROR zaqar     _fail(1, ex)
2016-10-04 15:20:15.405 1404 ERROR zaqar   File "/usr/lib/python2.7/site-packages/zaqar/common/cli.py", line 36, in _fail
2016-10-04 15:20:15.405 1404 ERROR zaqar     print(ex, file=sys.stderr)
2016-10-04 15:20:15.405 1404 ERROR zaqar IOError: [Errno 32] Broken pipe
2016-10-04 15:20:15.405 1404 ERROR zaqar


I'll attach the full zaqar.log file to this BZ.

Comment 8 Ronnie Rasouli 2017-09-06 10:30:42 UTC
Followed:
https://github.com/systemd/systemd/issues/2236
performed
systemctl restart systemd-journald

No mistral errors in logs nor on overcloud deploy command

Comment 9 errata-xmlrpc 2017-09-06 17:06:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2653


Note You need to log in before you can comment on or make changes to this bug.