Bug 1381644

Summary: Broken Pipe error for Zaqar, which causes Mistral failures
Product: Red Hat OpenStack Reporter: Dan Macpherson <dmacpher>
Component: openstack-zaqarAssignee: Thomas Hervé <therve>
Status: CLOSED ERRATA QA Contact: Amit Ugol <augol>
Severity: high Docs Contact:
Priority: medium    
Version: 10.0 (Newton)CC: apevec, augol, jjoyce, jschluet, lhh, mcornea, opavlenk, rrasouli, slinaber, tvignaud, zbitter
Target Milestone: z4Keywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-zaqar-3.0.0-4.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-06 17:06:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Output from "mistral execution-get" none

Description Dan Macpherson 2016-10-04 15:49:05 UTC
Description of problem:

I have an existing OSP10 overcloud stack and ran "openstack overcloud deploy" to modify it (testing Controller replacement). However, it comes to the part where it removes the existing plan and uploads the new plan via the tripleo.plan_management.v1.update_deployment_plan execution. The execution fails and "mistral execution-get" reports a traceback in the "State info" field. The traceback ends with the following:

YaqlEvaluationException: Can not evaluate YAQL expression:  $.status , data = {}; error: u'status'


I'll attach a copy of the "mistral execution-get" output to this BZ

Version-Release number of selected component (if applicable):
OSP 10

openstack-mistral-api.noarch 3.0.0-0.20160929083341.c0a4501.el7ost
openstack-mistral-common.noarch 3.0.0-0.20160929083341.c0a4501.el7ost
openstack-mistral-engine.noarch 3.0.0-0.20160929083341.c0a4501.el7ost
openstack-mistral-executor.noarch 3.0.0-0.20160929083341.c0a4501.el7ost
puppet-mistral.noarch 9.4.0-0.20160928210707.3738e95.el7ost
python-mistral-tests.noarch 3.0.0-0.20160929083341.c0a4501.el7ost
python-mistralclient.noarch 2.1.1-0.20160915221844.a37b5fe.el7ost
python-openstack-mistral.noarch 3.0.0-0.20160929083341.c0a4501.el7ost

How reproducible:
Always

Steps to Reproduce:
1. Run "openstack overcloud deploy" and wait for a successful deployment
2. Update the Heat templates
3. Rerun "openstack overcloud deploy" and wait for the tripleo.plan_management.v1.update_deployment_plan to fail

Actual results:


Expected results:


Additional info:

Comment 1 Dan Macpherson 2016-10-04 15:50:43 UTC
Created attachment 1207255 [details]
Output from "mistral execution-get"

Comment 2 Dan Macpherson 2016-10-04 16:08:05 UTC
Had a look at mistral's executor.log and found this around the same time as the execution attempt:

2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor [-] Failed to run action [action_ex_id=277e5da7-701e-4b15-add0-690e45d99e2a, 
action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'queue_post'}', params='{u'queue_name': 
u'8d8c79ec-198e-4425-baf3-ceeccd5e7dfa', u'messages': {u'body': {u'type': u'tripleo.plan_management.v1.update_deployment_plan', u'payload': {u's
tatus': u'SUCCESS', u'message': u'', u'execution': {u'input': {u'queue_name': u'8d8c79ec-198e-4425-baf3-ceeccd5e7dfa', u'container': u'overcloud
'}, u'spec': {u'input': [u'container', {u'queue_name': u'tripleo'}], u'tasks': {u'update_plan': {u'name': u'update_plan', u'on-error': u'update_
plan_set_status_failed', u'on-success': u'ensure_passwords_exist', u'version': u'2.0', u'action': u'tripleo.plan.update container=<% $.container
 %>', u'type': u'direct'}, u'ensure_passwords_exist': {u'name': u'ensure_passwords_exist', u'on-error': u'ensure_passwords_exist_set_status_fail
ed', u'on-success': u'process_templates', u'version': u'2.0', u'action': u'tripleo.parameters.generate_passwords container=<% $.container %>', u
'type': u'direct'}, u'set_status_success': {u'version': u'2.0', u'type': u'direct', u'name': u'set_status_success', u'publish': {u'status': u'SU
CCESS', u'message': u'<% task(update_plan).result %>'}, u'on-success': u'notify_zaqar'}, u'update_plan_set_status_failed': {u'version': u'2.0', 
u'type': u'direct', u'name': u'update_plan_set_status_failed', u'publish': {u'status': u'FAILED', u'message': u'<% task(update_plan).result %>'}
, u'on-success': u'notify_zaqar'}, u'process_templates_set_status_failed': {u'version': u'2.0', u'type': u'direct', u'name': u'process_templates
_set_status_failed', u'publish': {u'status': u'FAILED', u'message': u'<% task(process_templates).result %>'}, u'on-success': u'notify_zaqar'}, u
'notify_zaqar': {u'retry': u'count=5 delay=1', u'name': u'notify_zaqar', u'version': u'2.0', u'action': u'zaqar.queue_post', u'input': {u'queue_
name': u'<% $.queue_name %>', u'messages': {u'body': {u'type': u'tripleo.plan_management.v1.update_deployment_plan', u'payload': {u'status': u'<
% $.status %>', u'message': u"<% $.message or '' %>", u'execution': u'<% execution() %>'}}}}, u'type': u'direct'}, u'process_templates': {u'name
': u'process_templates', u'on-error': u'process_templates_set_status_failed', u'on-success': u'set_status_success', u'version': u'2.0', u'action
': u'tripleo.templates.process container=<% $.container %>', u'type': u'direct'}, u'ensure_passwords_exist_set_status_failed': {u'version': u'2.
0', u'type': u'direct', u'name': u'ensure_passwords_exist_set_status_failed', u'publish': {u'status': u'FAILED', u'message': u'<% task(ensure_pa
sswords_exist).result %>'}, u'on-success': u'notify_zaqar'}}, u'name': u'update_deployment_plan', u'version': u'2.0'}, u'params': {}, u'id': u'94912446-aabf-43b4-a7ed-8622e68721db', u'name': u'tripleo.plan_management.v1.update_deployment_plan'}}}}}']
 ZaqarAction.queue_post failed: <class 'requests.exceptions.ConnectionError'>: HTTPConnectionPool(host='192.168.201.5', port=8888): Max retries exceeded with url: /v2/queues/8d8c79ec-198e-4425-baf3-ceeccd5e7dfa/messages (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x6d90dd0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor Traceback (most recent call last):
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor   File "/usr/lib/python2.7/site-packages/mistral/engine/default_executor.py", line 90, in run_action
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor     result = action.run()
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor   File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 142, in run
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor     (self.__class__.__name__, self.client_method_name, e_str)
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor ActionException: ZaqarAction.queue_post failed: <class 'requests.exceptions.ConnectionError'>: HTTPConnectionPool(host='192.168.201.5', port=8888): Max retries exceeded with url: /v2/queues/8d8c79ec-198e-4425-baf3-ceeccd5e7dfa/messages (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x6d90dd0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2016-10-05 01:33:47.596 1388 ERROR mistral.engine.default_executor 

It seems to be an issue with connecting to zaqar. I restarted both the mistral and zaqar services and the deployment went ahead with not problems.

Comment 3 Dan Macpherson 2016-10-04 16:12:48 UTC
Okay, this appears to be a problem with zaqar. I experienced a Broken Pipe error. I've noticed several Broken Pipe errors at certain points, which explains why I've been having other deployment issues with OSP 10.

All the broken pipe errors follow the same format: 

1) Waring: Failed to increment the message counter for queue 
2) Broken Pipe

For example:

2016-10-04 15:20:11.532 1404 WARNING zaqar.storage.mongodb.queues [(None,) fa56eb6a60fb4291a0242e482b8d7c0d e1b8d2ae29b947cc946383a45f5a680c - -
 -] [project_id:e1b8d2ae29b947cc946383a45f5a680c] Failed to increment the message counter for queue a585b0fc-8a93-441f-8e16-7c8218f34190 and pro
ject e1b8d2ae29b947cc946383a45f5a680c
2016-10-04 15:20:15.405 1404 CRITICAL zaqar [(None,) fa56eb6a60fb4291a0242e482b8d7c0d e1b8d2ae29b947cc946383a45f5a680c - - -] [project_id:e1b8d2
ae29b947cc946383a45f5a680c] IOError: [Errno 32] Broken pipe
2016-10-04 15:20:15.405 1404 ERROR zaqar Traceback (most recent call last):
2016-10-04 15:20:15.405 1404 ERROR zaqar   File "/usr/bin/zaqar-server", line 10, in <module>
2016-10-04 15:20:15.405 1404 ERROR zaqar     sys.exit(run())
2016-10-04 15:20:15.405 1404 ERROR zaqar   File "/usr/lib/python2.7/site-packages/zaqar/common/cli.py", line 58, in _wrapper
2016-10-04 15:20:15.405 1404 ERROR zaqar     _fail(1, ex)
2016-10-04 15:20:15.405 1404 ERROR zaqar   File "/usr/lib/python2.7/site-packages/zaqar/common/cli.py", line 36, in _fail
2016-10-04 15:20:15.405 1404 ERROR zaqar     print(ex, file=sys.stderr)
2016-10-04 15:20:15.405 1404 ERROR zaqar IOError: [Errno 32] Broken pipe
2016-10-04 15:20:15.405 1404 ERROR zaqar


I'll attach the full zaqar.log file to this BZ.

Comment 8 Ronnie Rasouli 2017-09-06 10:30:42 UTC
Followed:
https://github.com/systemd/systemd/issues/2236
performed
systemctl restart systemd-journald

No mistral errors in logs nor on overcloud deploy command

Comment 9 errata-xmlrpc 2017-09-06 17:06:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2653