Bug 1498916 - [UPDATES] update on all nodes finishes but mistral fails to receive notification
Summary: [UPDATES] update on all nodes finishes but mistral fails to receive notification
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ga
: 12.0 (Pike)
Assignee: Marios Andreou
QA Contact: Yurii Prokulevych
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-05 14:34 UTC by Lukas Bezdicka
Modified: 2023-02-22 23:02 UTC (History)
14 users (show)

Fixed In Version: openstack-tripleo-common-7.6.3-4.el7ost.noarch
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 22:13:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1730663 0 None None None 2017-11-07 13:32:08 UTC
OpenStack gerrit 520571 0 None MERGED Truncate the zaqar message to 512 kbytes 2020-12-23 11:55:53 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Lukas Bezdicka 2017-10-05 14:34:44 UTC
Run update on all nodes:
(undercloud) [stack@undercloud-0 ~ (undercloud-12-TLV)]$ openstack overcloud update stack 
Started Mistral Workflow tripleo.package_update.v1.update_nodes. Execution ID: 0b455e17-611d-4cb3-91a2-714f13a3a30e
Waiting for messages on queue '04b0b808-da54-4d55-b01a-6bb13194ad71' with no timeout.
Update finished but it gets stuck waiting for mistral execution. It was waiting for message:

fig', u'type': u'direct'}}, u'name': u'update_nodes', u'tags': [u'tripleo-common-managed'], u'version': u'2.0', u'input': [{u'node_user': u'heat-admin'}, u'nodes', u'playbook', u'inventory_file', {u'queue_name': u'tripleo'}], u'description': u'Take a container and perform an update nodes by nodes'}}}}}}']
 ZaqarAction.queue_post failed: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576.
] (execution_id=0b455e17-611d-4cb3-91a2-714f13a3a30e)
2017-10-05 10:05:53.812 17734 DEBUG mistral.services.triggers [req-eac2a476-cb08-4d63-8ddb-c2dc788d3f6d 81c58aff164344a9811bddd511c739ea 9ed039ecceba4438b980339efe25a93a - default default] No JSON object could be decoded on_workflow_complete /usr/lib/python2.7/site-packages/mistral/services/triggers.py:239

Comment 1 Dougal Matthews 2017-10-11 12:41:13 UTC
Can you provide the Mistral logs for this? I'm having trouble tracking down the issue.

It looks like the workflow is attempting to send a message to Zaqar that is larger than the allowed limit. From reading the tripleo.package_update.v1 workflow and the custom action I can't figure out where that would come from.

I'm hoping that a traceback in the logs will provide more details

Comment 3 Thomas Hervé 2017-11-02 20:30:28 UTC
The message size is already set by instack. The messages posted here is the result of the ansible/puppet upgrade run, it's about 1.2M, more than the 1M allowed. I suggest limiting the message, something like this: http://paste.openstack.org/show/625389/ in tripleo-common.

That said, it's bad to have that much data transit in ansible/mistral. Long term, it'd be nice to either produce less logs, or push them to swift directly. There is also an unhealthy amount of warnings produced by the puppet run.

Comment 4 Marios Andreou 2017-11-07 12:48:43 UTC
o/ thanks Thomas - yeah agree the truncate is not ideal and have been holding off on posting the review to tripleo-common this morning hoping someone would come up with a better way. I haven't heard one so I'll post it in a moment anyway and we can take it from there.

Comment 5 Marios Andreou 2017-11-20 12:24:47 UTC
this is merged to pike so moving POST. 

Note that thankfully there is a better fix being tracked for https://bugzilla.redhat.com/show_bug.cgi?id=1505926 which will prevent these huge messages in the first place.

Comment 7 Jon Schlueter 2017-11-22 17:07:49 UTC
openstack-tripleo-common-7.6.3-4.el7ost

Comment 11 Yurii Prokulevych 2017-12-11 11:30:49 UTC
Verified with openstack-tripleo-common-7.6.3-8.el7ost.noarch

tail oc-update-*log
==> oc-update-00-Controller.log <==
 u'TASK [debug] *******************************************************************',
 u'skipping: [192.168.24.20]',
 u'',
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.15              : ok=112  changed=56   unreachable=0    failed=0   ',
 u'192.168.24.17              : ok=114  changed=56   unreachable=0    failed=0   ',
 u'192.168.24.20              : ok=112  changed=56   unreachable=0    failed=0   ',
 u'']
('Response is not a JSON object.', ValueError('No JSON object could be decoded',))
Success

==> oc-update-CephStorage.log <==
 u'TASK [debug] *******************************************************************',
 u'skipping: [192.168.24.18]',
 u'',
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.14              : ok=56   changed=13   unreachable=0    failed=0   ',
 u'192.168.24.18              : ok=56   changed=13   unreachable=0    failed=0   ',
 u'192.168.24.9               : ok=56   changed=13   unreachable=0    failed=0   ',
 u'']
('Response is not a JSON object.', ValueError('No JSON object could be decoded',))
Success

==> oc-update-Compute.log <==
 u'',
 u'TASK [debug] *******************************************************************',
 u'skipping: [192.168.24.10]',
 u'',
 u'PLAY RECAP *********************************************************************',
 u'192.168.24.10              : ok=58   changed=13   unreachable=0    failed=0   ',
 u'192.168.24.12              : ok=58   changed=13   unreachable=0    failed=0   ',
 u'']
('Response is not a JSON object.', ValueError('No JSON object could be decoded',))
Success

Comment 14 errata-xmlrpc 2017-12-13 22:13:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.