Bug 1664055 - Message collection size is too large for Zaqar
Summary: Message collection size is too large for Zaqar
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: ---
Assignee: Adriano Petrich
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks: 1712278
TreeView+ depends on / blocked
 
Reported: 2019-01-07 15:42 UTC by Udi Kalifon
Modified: 2023-12-15 16:17 UTC (History)
18 users (show)

Fixed In Version: openstack-tripleo-common-10.8.2-0.20191125220527.c2a83c1.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1712278 (view as bug list)
Environment:
Last Closed: 2020-02-27 15:56:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 629007 0 'None' MERGED Use execution_id in Zaqar messages handler 2020-12-07 17:09:26 UTC
OpenStack gerrit 630291 0 'None' MERGED Stop sending execution object via Zaqar message 2020-12-07 17:09:27 UTC
OpenStack gerrit 630970 0 'None' MERGED Remove execution from workflow message send 2020-12-07 17:09:26 UTC
OpenStack gerrit 682151 0 'None' MERGED Stop sending execution object via Zaqar message 2020-12-07 17:09:54 UTC
Red Hat Knowledge Base (Solution) 4396231 0 None None None 2019-09-06 02:30:56 UTC

Description Udi Kalifon 2019-01-07 15:42:05 UTC
Description of problem:
I tried to deploy 3 controllers + 2 computes + 3 ceph with network isolation in a virtual test environment from the GUI (OSP14). My only configuration mistake was that I forgot to change the Ceph defaults which give too many pages and pools, so the deployment failed.

From the GUI, I clicked open the failure details dialog - but the workflow got stuck. We see this in the executor.log:

# /var/log/containers/mistral/executor.log 

ZaqarAction.queue_post failed: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576.: ActionException: ZaqarAction.queue_post failed: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576.
2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor Traceback (most recent call last):
2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 114, in run_action
2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor     result = action.run(action_ctx)
2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 130, in run
2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor     (self.__class__.__name__, self.client_method_name, str(e))
2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor ActionException: ZaqarAction.queue_post failed: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576.


Version-Release number of selected component (if applicable):
openstack-zaqar-7.0.1-0.20180917132250.5932b8f.el7ost.noarch


How reproducible:
unknown


Steps to Reproduce:
1. Deploy a setup as described above

Comment 2 Jiri Tomasek 2019-01-07 16:12:32 UTC
In general, we can remove deprecated 'execution' item from zaqar message payload [1]. This item can potentially contain a lot of data.

Specifically for this bug, the "deployment_failures" [2] can be quite large, so we should either somehow reduce it or don't include it in the message.

[1] https://github.com/openstack/tripleo-common/blob/master/workbooks/messaging.yaml#L34
[2] https://github.com/openstack/tripleo-common/blob/master/workbooks/deployment.yaml#L939

Comment 3 Jiri Tomasek 2019-01-07 16:22:11 UTC
Partial fix: https://review.openstack.org/629007 - makes tripleo-ui ready for removal of deprecated 'execution' object from Zaqar message payload

Comment 4 Jiri Tomasek 2019-01-11 15:12:00 UTC
Before it is possible to remove execution from Zaqar message payload, python-tripleoclient needs to get updated [1]:
wait_for_message function needs to stop referencing execution id from execution object and use execution_id instead. In addition, once the message arrives or when timeout runs out, we should start polling for the execution until it is not in RUNNING state. At that point, execution should be returned. This allows us to keep zaqar messages small while providing the execution data once workflow finishes.

Once that is in, we can update send_message worklfow to remove execution from the payload (and include root_execution_id which is used by tripleoclient too)

[1] https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/workflows/base.py#L48

Comment 5 Jiri Tomasek 2019-01-11 15:34:35 UTC
Patch for tripleo-common send_message workflow - needs to merge after tripleoclient patches https://review.openstack.org/630291 Stop sending execution object via Zaqar message

Comment 6 Chris Smart 2019-02-05 07:36:24 UTC
I seem to be hitting this on Queens release with PKCS split controller configuration with the following:

  ServiceCount: 5
  CoreCount: 3
  TelemetryCount: 3
  ComputeType1Count: 3
  ComputeType2Count: 2
  ComputeType3Count: 5
  ComputeType4Count: 2


Undercloud install result:

...
Processing templates in the directory /tmp/tripleoclient-ntwHQD/tripleo-heat-templates
Started Mistral Workflow tripleo.plan_management.v1.get_deprecated_parameters. Execution ID: 54ef069a-ff91-4fbb-859b-54f7cc341415
Deploying templates in the directory /tmp/tripleoclient-ntwHQD/tripleo-heat-templates
Started Mistral Workflow tripleo.deployment.v1.deploy_plan. Execution ID: 0eac0392-e6d5-4e7c-9f0e-3f48fe454658

('The read operation timed out',)

real    14m1.281s
user    0m3.025s
sys     0m0.551s



Mistral executor log:

...
2019-02-05 17:57:19.072 6537 WARNING mistral.actions.openstack.base [req-72313140-9c9c-43ca-a4c1-5951aef7b754 614efccc01de4ca7987bbe6c2aec651a 0004aef92a3d41f6ba6854b547ffb92e - default default] Traceback (most
recent call last):
  File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 117, in run
    result = method(**self._kwargs_for_run)
  File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 436, in wrap
    return method(client, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 482, in queue_post
    return queue.post(messages)
  File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/queues.py", line 170, in post
    self._name, messages)
  File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/core.py", line 242, in message_post
    resp = transport.send(request)
  File "/usr/lib/python2.7/site-packages/zaqarclient/transport/http.py", line 114, in send
    raise self.http_to_zaqar[resp.status_code](**kwargs)
MalformedRequest: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576.


Doubling the max messages size in /etc/zaqar/zaqar.conf got me going:

  max_messages_post_size=2097152

Comment 9 Adriano Petrich 2019-04-17 08:13:43 UTC
I'm looking into the patches that still needs fixing before merging. then we can look at what to do in relation to backporting to OSP 13

Comment 10 Chris Smart 2019-04-30 03:29:07 UTC
Actually I just hit another in RHOSP13:

# tail -f /var/log/mistral/engine.log
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 715, in fire_replace_event
    state, value, previous, initiator or self._replace_token)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/events.py", line 1893, in wrap
    fn(target, *arg)
  File "/usr/lib/python2.7/site-packages/mistral/db/v2/sqlalchemy/models.py", line 94, in <lambda>
    lambda t, v, o, i: validate_long_type_length(cls, attr_name, v)
  File "/usr/lib/python2.7/site-packages/mistral/db/v2/sqlalchemy/models.py", line 80, in validate_long_type_length
    raise exc.SizeLimitExceededException(msg)
SizeLimitExceededException: Field size limit exceeded [class=TaskExecution, field=output, size=17942KB, limit=16384KB]

So I had to increase this in zaqar.conf and restart the service:

#producer_batch_size=16384
producer_batch_size=32768

Comment 11 Chris Smart 2019-04-30 05:08:43 UTC
(In reply to Chris Smart from comment #10)
> So I had to increase this in zaqar.conf and restart the service:
> 
> #producer_batch_size=16384
> producer_batch_size=32768

Sorry, pasted wrong config file... :-S

/etc/mistral/mistral.conf

[engine]
execution_field_size_limit_kb=32768

Comment 13 Adriano Petrich 2019-05-02 08:57:56 UTC
I thing that I've traced the error and I'm testing a patch for it.

Comment 14 Adriano Petrich 2019-05-06 15:05:43 UTC
This https://review.opendev.org/#/c/630291/ and the depends on patch fixes the issue.

Comment 16 Uemit Seren 2019-07-04 16:02:38 UTC
We are running into the same issue with Queens (OSP13).  This happened after we added the 8th node role. 

This fixed it (thanks to the previous comments by Chris): 

/etc/mistral/mistral.conf
[engine]
execution_field_size_limit_kb=32768


/etc/zaqar/zaqar.conf
[transport]
max_messages_post_size=2097152

Backport would be great.

Comment 19 Lon Hohberger 2019-12-03 17:38:11 UTC
This bug is currently filed against OSP15. If fixes are required for other releases, please clone the bug.

Comment 21 Alex McLeod 2020-02-19 12:43:55 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 22 David Rosenfeld 2020-02-27 15:56:29 UTC
Discussed this BZ with DFG and resolution was to close.

Thanks, sounds like we should just close it. Had never seen this BZ until we got an automated message saying it needed to be verified. It doesn't seem like something that a lot of time should be spent on.
On 2/26/20 2:59 PM, Phil Weeks wrote:
> David, as both alex and emilien have inferred. UI was deprecated in 15.
> Only maintenance activity is ongoing for 14 and earlier.
> If you indeed feel there's a verification needed, UI maintenance is part of cloudops DFG.
> Phil
>
> On Wed, Feb 26, 2020 at 1:12 PM Emilien Macchi <emilien> wrote:
>
>     I would close that bug.
>     There is no more DFG for UI and the rhos-prio cases are closed.
>
>     On Wed, Feb 26, 2020 at 1:10 PM Joe Hakim Rahme <jhakimra> wrote:
>
>         > It was found using a gui. We don't have any gui tests or a way to verify
>         > it. Is there any group or DFG that should be doing that?
>
>         Unfortunately it seems that nobody currently has the tools or the
>         knowledge to properly qualify (G)UI tools of openstack. We are
>         addressing this situation by looking to hire engineers to focus on
>         this, and it probably won't be part of DFG:DF.
>
>         In the meantime, does anyone in the DFG have enough experience with
>         GUI to help David verify this BZ properly? If nobody can, then I'm
>         afraid we'll have to delay the release of this BZ until we have
>         someone to take care of it.


Note You need to log in before you can comment on or make changes to this bug.