Description of problem: I tried to deploy 3 controllers + 2 computes + 3 ceph with network isolation in a virtual test environment from the GUI (OSP14). My only configuration mistake was that I forgot to change the Ceph defaults which give too many pages and pools, so the deployment failed. From the GUI, I clicked open the failure details dialog - but the workflow got stuck. We see this in the executor.log: # /var/log/containers/mistral/executor.log ZaqarAction.queue_post failed: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576.: ActionException: ZaqarAction.queue_post failed: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576. 2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor Traceback (most recent call last): 2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 114, in run_action 2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor result = action.run(action_ctx) 2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 130, in run 2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor (self.__class__.__name__, self.client_method_name, str(e)) 2019-01-07 09:37:29.806 1 ERROR mistral.executors.default_executor ActionException: ZaqarAction.queue_post failed: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576. Version-Release number of selected component (if applicable): openstack-zaqar-7.0.1-0.20180917132250.5932b8f.el7ost.noarch How reproducible: unknown Steps to Reproduce: 1. Deploy a setup as described above
In general, we can remove deprecated 'execution' item from zaqar message payload [1]. This item can potentially contain a lot of data. Specifically for this bug, the "deployment_failures" [2] can be quite large, so we should either somehow reduce it or don't include it in the message. [1] https://github.com/openstack/tripleo-common/blob/master/workbooks/messaging.yaml#L34 [2] https://github.com/openstack/tripleo-common/blob/master/workbooks/deployment.yaml#L939
Partial fix: https://review.openstack.org/629007 - makes tripleo-ui ready for removal of deprecated 'execution' object from Zaqar message payload
Before it is possible to remove execution from Zaqar message payload, python-tripleoclient needs to get updated [1]: wait_for_message function needs to stop referencing execution id from execution object and use execution_id instead. In addition, once the message arrives or when timeout runs out, we should start polling for the execution until it is not in RUNNING state. At that point, execution should be returned. This allows us to keep zaqar messages small while providing the execution data once workflow finishes. Once that is in, we can update send_message worklfow to remove execution from the payload (and include root_execution_id which is used by tripleoclient too) [1] https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/workflows/base.py#L48
Patch for tripleo-common send_message workflow - needs to merge after tripleoclient patches https://review.openstack.org/630291 Stop sending execution object via Zaqar message
I seem to be hitting this on Queens release with PKCS split controller configuration with the following: ServiceCount: 5 CoreCount: 3 TelemetryCount: 3 ComputeType1Count: 3 ComputeType2Count: 2 ComputeType3Count: 5 ComputeType4Count: 2 Undercloud install result: ... Processing templates in the directory /tmp/tripleoclient-ntwHQD/tripleo-heat-templates Started Mistral Workflow tripleo.plan_management.v1.get_deprecated_parameters. Execution ID: 54ef069a-ff91-4fbb-859b-54f7cc341415 Deploying templates in the directory /tmp/tripleoclient-ntwHQD/tripleo-heat-templates Started Mistral Workflow tripleo.deployment.v1.deploy_plan. Execution ID: 0eac0392-e6d5-4e7c-9f0e-3f48fe454658 ('The read operation timed out',) real 14m1.281s user 0m3.025s sys 0m0.551s Mistral executor log: ... 2019-02-05 17:57:19.072 6537 WARNING mistral.actions.openstack.base [req-72313140-9c9c-43ca-a4c1-5951aef7b754 614efccc01de4ca7987bbe6c2aec651a 0004aef92a3d41f6ba6854b547ffb92e - default default] Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 117, in run result = method(**self._kwargs_for_run) File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 436, in wrap return method(client, *args, **kwargs) File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 482, in queue_post return queue.post(messages) File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/queues.py", line 170, in post self._name, messages) File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/core.py", line 242, in message_post resp = transport.send(request) File "/usr/lib/python2.7/site-packages/zaqarclient/transport/http.py", line 114, in send raise self.http_to_zaqar[resp.status_code](**kwargs) MalformedRequest: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576. Doubling the max messages size in /etc/zaqar/zaqar.conf got me going: max_messages_post_size=2097152
I'm looking into the patches that still needs fixing before merging. then we can look at what to do in relation to backporting to OSP 13
Actually I just hit another in RHOSP13: # tail -f /var/log/mistral/engine.log File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 715, in fire_replace_event state, value, previous, initiator or self._replace_token) File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/events.py", line 1893, in wrap fn(target, *arg) File "/usr/lib/python2.7/site-packages/mistral/db/v2/sqlalchemy/models.py", line 94, in <lambda> lambda t, v, o, i: validate_long_type_length(cls, attr_name, v) File "/usr/lib/python2.7/site-packages/mistral/db/v2/sqlalchemy/models.py", line 80, in validate_long_type_length raise exc.SizeLimitExceededException(msg) SizeLimitExceededException: Field size limit exceeded [class=TaskExecution, field=output, size=17942KB, limit=16384KB] So I had to increase this in zaqar.conf and restart the service: #producer_batch_size=16384 producer_batch_size=32768
(In reply to Chris Smart from comment #10) > So I had to increase this in zaqar.conf and restart the service: > > #producer_batch_size=16384 > producer_batch_size=32768 Sorry, pasted wrong config file... :-S /etc/mistral/mistral.conf [engine] execution_field_size_limit_kb=32768
I thing that I've traced the error and I'm testing a patch for it.
This https://review.opendev.org/#/c/630291/ and the depends on patch fixes the issue.
We are running into the same issue with Queens (OSP13). This happened after we added the 8th node role. This fixed it (thanks to the previous comments by Chris): /etc/mistral/mistral.conf [engine] execution_field_size_limit_kb=32768 /etc/zaqar/zaqar.conf [transport] max_messages_post_size=2097152 Backport would be great.
This bug is currently filed against OSP15. If fixes are required for other releases, please clone the bug.
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.
Discussed this BZ with DFG and resolution was to close. Thanks, sounds like we should just close it. Had never seen this BZ until we got an automated message saying it needed to be verified. It doesn't seem like something that a lot of time should be spent on. On 2/26/20 2:59 PM, Phil Weeks wrote: > David, as both alex and emilien have inferred. UI was deprecated in 15. > Only maintenance activity is ongoing for 14 and earlier. > If you indeed feel there's a verification needed, UI maintenance is part of cloudops DFG. > Phil > > On Wed, Feb 26, 2020 at 1:12 PM Emilien Macchi <emilien> wrote: > > I would close that bug. > There is no more DFG for UI and the rhos-prio cases are closed. > > On Wed, Feb 26, 2020 at 1:10 PM Joe Hakim Rahme <jhakimra> wrote: > > > It was found using a gui. We don't have any gui tests or a way to verify > > it. Is there any group or DFG that should be doing that? > > Unfortunately it seems that nobody currently has the tools or the > knowledge to properly qualify (G)UI tools of openstack. We are > addressing this situation by looking to hire engineers to focus on > this, and it probably won't be part of DFG:DF. > > In the meantime, does anyone in the DFG have enough experience with > GUI to help David verify this BZ properly? If nobody can, then I'm > afraid we'll have to delay the release of this BZ until we have > someone to take care of it.