Bug 1498916

Summary:	[UPDATES] update on all nodes finishes but mistral fails to receive notification
Product:	Red Hat OpenStack	Reporter:	Lukas Bezdicka <lbezdick>
Component:	openstack-tripleo-common	Assignee:	Marios Andreou <mandreou>
Status:	CLOSED ERRATA	QA Contact:	Yurii Prokulevych <yprokule>
Severity:	high	Docs Contact:
Priority:	high
Version:	12.0 (Pike)	CC:	augol, ccamacho, dmatthew, emacchi, jpichon, jschluet, lbezdick, mandreou, mbultel, mburns, sclewis, slinaber, therve, yprokule
Target Milestone:	ga	Keywords:	Triaged
Target Release:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-common-7.6.3-4.el7ost.noarch	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-13 22:13:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lukas Bezdicka 2017-10-05 14:34:44 UTC

Run update on all nodes:
(undercloud) [stack@undercloud-0 ~ (undercloud-12-TLV)]$ openstack overcloud update stack 
Started Mistral Workflow tripleo.package_update.v1.update_nodes. Execution ID: 0b455e17-611d-4cb3-91a2-714f13a3a30e
Waiting for messages on queue '04b0b808-da54-4d55-b01a-6bb13194ad71' with no timeout.
Update finished but it gets stuck waiting for mistral execution. It was waiting for message:

fig', u'type': u'direct'}}, u'name': u'update_nodes', u'tags': [u'tripleo-common-managed'], u'version': u'2.0', u'input': [{u'node_user': u'heat-admin'}, u'nodes', u'playbook', u'inventory_file', {u'queue_name': u'tripleo'}], u'description': u'Take a container and perform an update nodes by nodes'}}}}}}']
 ZaqarAction.queue_post failed: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576.
] (execution_id=0b455e17-611d-4cb3-91a2-714f13a3a30e)
2017-10-05 10:05:53.812 17734 DEBUG mistral.services.triggers [req-eac2a476-cb08-4d63-8ddb-c2dc788d3f6d 81c58aff164344a9811bddd511c739ea 9ed039ecceba4438b980339efe25a93a - default default] No JSON object could be decoded on_workflow_complete /usr/lib/python2.7/site-packages/mistral/services/triggers.py:239

Comment 1 Dougal Matthews 2017-10-11 12:41:13 UTC

Can you provide the Mistral logs for this? I'm having trouble tracking down the issue.

It looks like the workflow is attempting to send a message to Zaqar that is larger than the allowed limit. From reading the tripleo.package_update.v1 workflow and the custom action I can't figure out where that would come from.

I'm hoping that a traceback in the logs will provide more details

Comment 3 Thomas Hervé 2017-11-02 20:30:28 UTC

The message size is already set by instack. The messages posted here is the result of the ansible/puppet upgrade run, it's about 1.2M, more than the 1M allowed. I suggest limiting the message, something like this: http://paste.openstack.org/show/625389/ in tripleo-common.

That said, it's bad to have that much data transit in ansible/mistral. Long term, it'd be nice to either produce less logs, or push them to swift directly. There is also an unhealthy amount of warnings produced by the puppet run.

Comment 4 Marios Andreou 2017-11-07 12:48:43 UTC

o/ thanks Thomas - yeah agree the truncate is not ideal and have been holding off on posting the review to tripleo-common this morning hoping someone would come up with a better way. I haven't heard one so I'll post it in a moment anyway and we can take it from there.

Comment 5 Marios Andreou 2017-11-20 12:24:47 UTC

this is merged to pike so moving POST. 

Note that thankfully there is a better fix being tracked for https://bugzilla.redhat.com/show_bug.cgi?id=1505926 which will prevent these huge messages in the first place.

Comment 7 Jon Schlueter 2017-11-22 17:07:49 UTC

openstack-tripleo-common-7.6.3-4.el7ost

Comment 11 Yurii Prokulevych 2017-12-11 11:30:49 UTC

Verified with openstack-tripleo-common-7.6.3-8.el7ost.noarch

tail oc-update-*log
==> oc-update-00-Controller.log <==
 u'TASK [debug] *******************************************************************',
 u'skipping: [192.168.24.20]',
 u'',
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.15              : ok=112  changed=56   unreachable=0    failed=0   ',
 u'192.168.24.17              : ok=114  changed=56   unreachable=0    failed=0   ',
 u'192.168.24.20              : ok=112  changed=56   unreachable=0    failed=0   ',
 u'']
('Response is not a JSON object.', ValueError('No JSON object could be decoded',))
Success

==> oc-update-CephStorage.log <==
 u'TASK [debug] *******************************************************************',
 u'skipping: [192.168.24.18]',
 u'',
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.14              : ok=56   changed=13   unreachable=0    failed=0   ',
 u'192.168.24.18              : ok=56   changed=13   unreachable=0    failed=0   ',
 u'192.168.24.9               : ok=56   changed=13   unreachable=0    failed=0   ',
 u'']
('Response is not a JSON object.', ValueError('No JSON object could be decoded',))
Success

==> oc-update-Compute.log <==
 u'',
 u'TASK [debug] *******************************************************************',
 u'skipping: [192.168.24.10]',
 u'',
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.10              : ok=58   changed=13   unreachable=0    failed=0   ',
 u'192.168.24.12              : ok=58   changed=13   unreachable=0    failed=0   ',
 u'']
('Response is not a JSON object.', ValueError('No JSON object could be decoded',))
Success

Comment 14 errata-xmlrpc 2017-12-13 22:13:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462