Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1498916 - [UPDATES] update on all nodes finishes but mistral fails to receive notification [NEEDINFO]
[UPDATES] update on all nodes finishes but mistral fails to receive notification
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common (Show other bugs)
12.0 (Pike)
Unspecified Unspecified
high Severity high
: ga
: 12.0 (Pike)
Assigned To: Marios Andreou
Yurii Prokulevych
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-10-05 10:34 EDT by Lukas Bezdicka
Modified: 2018-02-05 14:15 EST (History)
15 users (show)

See Also:
Fixed In Version: openstack-tripleo-common-7.6.3-4.el7ost.noarch
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-12-13 17:13:08 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
dmatthew: needinfo? (lbezdick)


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1730663 None None None 2017-11-07 08:32 EST
OpenStack gerrit 520571 None None None 2017-11-17 02:20 EST
Red Hat Product Errata RHEA-2017:3462 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-15 20:43:25 EST

  None (edit)
Description Lukas Bezdicka 2017-10-05 10:34:44 EDT
Run update on all nodes:
(undercloud) [stack@undercloud-0 ~ (undercloud-12-TLV)]$ openstack overcloud update stack 
Started Mistral Workflow tripleo.package_update.v1.update_nodes. Execution ID: 0b455e17-611d-4cb3-91a2-714f13a3a30e
Waiting for messages on queue '04b0b808-da54-4d55-b01a-6bb13194ad71' with no timeout.
Update finished but it gets stuck waiting for mistral execution. It was waiting for message:

fig', u'type': u'direct'}}, u'name': u'update_nodes', u'tags': [u'tripleo-common-managed'], u'version': u'2.0', u'input': [{u'node_user': u'heat-admin'}, u'nodes', u'playbook', u'inventory_file', {u'queue_name': u'tripleo'}], u'description': u'Take a container and perform an update nodes by nodes'}}}}}}']
 ZaqarAction.queue_post failed: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576.
] (execution_id=0b455e17-611d-4cb3-91a2-714f13a3a30e)
2017-10-05 10:05:53.812 17734 DEBUG mistral.services.triggers [req-eac2a476-cb08-4d63-8ddb-c2dc788d3f6d 81c58aff164344a9811bddd511c739ea 9ed039ecceba4438b980339efe25a93a - default default] No JSON object could be decoded on_workflow_complete /usr/lib/python2.7/site-packages/mistral/services/triggers.py:239
Comment 1 Dougal Matthews 2017-10-11 08:41:13 EDT
Can you provide the Mistral logs for this? I'm having trouble tracking down the issue.

It looks like the workflow is attempting to send a message to Zaqar that is larger than the allowed limit. From reading the tripleo.package_update.v1 workflow and the custom action I can't figure out where that would come from.

I'm hoping that a traceback in the logs will provide more details
Comment 3 Thomas Hervé 2017-11-02 16:30:28 EDT
The message size is already set by instack. The messages posted here is the result of the ansible/puppet upgrade run, it's about 1.2M, more than the 1M allowed. I suggest limiting the message, something like this: http://paste.openstack.org/show/625389/ in tripleo-common.

That said, it's bad to have that much data transit in ansible/mistral. Long term, it'd be nice to either produce less logs, or push them to swift directly. There is also an unhealthy amount of warnings produced by the puppet run.
Comment 4 Marios Andreou 2017-11-07 07:48:43 EST
o/ thanks Thomas - yeah agree the truncate is not ideal and have been holding off on posting the review to tripleo-common this morning hoping someone would come up with a better way. I haven't heard one so I'll post it in a moment anyway and we can take it from there.
Comment 5 Marios Andreou 2017-11-20 07:24:47 EST
this is merged to pike so moving POST. 

Note that thankfully there is a better fix being tracked for https://bugzilla.redhat.com/show_bug.cgi?id=1505926 which will prevent these huge messages in the first place.
Comment 7 Jon Schlueter 2017-11-22 12:07:49 EST
openstack-tripleo-common-7.6.3-4.el7ost
Comment 11 Yurii Prokulevych 2017-12-11 06:30:49 EST
Verified with openstack-tripleo-common-7.6.3-8.el7ost.noarch

tail oc-update-*log
==> oc-update-00-Controller.log <==
 u'TASK [debug] *******************************************************************',
 u'skipping: [192.168.24.20]',
 u'',
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.15              : ok=112  changed=56   unreachable=0    failed=0   ',
 u'192.168.24.17              : ok=114  changed=56   unreachable=0    failed=0   ',
 u'192.168.24.20              : ok=112  changed=56   unreachable=0    failed=0   ',
 u'']
('Response is not a JSON object.', ValueError('No JSON object could be decoded',))
Success

==> oc-update-CephStorage.log <==
 u'TASK [debug] *******************************************************************',
 u'skipping: [192.168.24.18]',
 u'',
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.14              : ok=56   changed=13   unreachable=0    failed=0   ',
 u'192.168.24.18              : ok=56   changed=13   unreachable=0    failed=0   ',
 u'192.168.24.9               : ok=56   changed=13   unreachable=0    failed=0   ',
 u'']
('Response is not a JSON object.', ValueError('No JSON object could be decoded',))
Success

==> oc-update-Compute.log <==
 u'',
 u'TASK [debug] *******************************************************************',
 u'skipping: [192.168.24.10]',
 u'',
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.10              : ok=58   changed=13   unreachable=0    failed=0   ',
 u'192.168.24.12              : ok=58   changed=13   unreachable=0    failed=0   ',
 u'']
('Response is not a JSON object.', ValueError('No JSON object could be decoded',))
Success
Comment 14 errata-xmlrpc 2017-12-13 17:13:08 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462

Note You need to log in before you can comment on or make changes to this bug.