Bug 1664055
| Summary: | Message collection size is too large for Zaqar | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Udi Kalifon <ukalifon> | |
| Component: | openstack-tripleo-common | Assignee: | Adriano Petrich <apetrich> | |
| Status: | CLOSED EOL | QA Contact: | Alexander Chuzhoy <sasha> | |
| Severity: | high | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 14.0 (Rocky) | CC: | apetrich, apevec, beth.white, bshephar, chrisbro, chris.smart, drosenfe, dvd, jbiao, jbuchta, jschluet, jtomasek, lhh, mburns, rrasouli, slinaber, sputhenp, uemit.seren | |
| Target Milestone: | --- | Keywords: | Triaged, ZStream | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | openstack-tripleo-common-10.8.2-0.20191125220527.c2a83c1.el8ost | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1712278 (view as bug list) | Environment: | ||
| Last Closed: | 2020-02-27 15:56:29 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1712278 | |||
|
Description
Udi Kalifon
2019-01-07 15:42:05 UTC
In general, we can remove deprecated 'execution' item from zaqar message payload [1]. This item can potentially contain a lot of data. Specifically for this bug, the "deployment_failures" [2] can be quite large, so we should either somehow reduce it or don't include it in the message. [1] https://github.com/openstack/tripleo-common/blob/master/workbooks/messaging.yaml#L34 [2] https://github.com/openstack/tripleo-common/blob/master/workbooks/deployment.yaml#L939 Partial fix: https://review.openstack.org/629007 - makes tripleo-ui ready for removal of deprecated 'execution' object from Zaqar message payload Before it is possible to remove execution from Zaqar message payload, python-tripleoclient needs to get updated [1]: wait_for_message function needs to stop referencing execution id from execution object and use execution_id instead. In addition, once the message arrives or when timeout runs out, we should start polling for the execution until it is not in RUNNING state. At that point, execution should be returned. This allows us to keep zaqar messages small while providing the execution data once workflow finishes. Once that is in, we can update send_message worklfow to remove execution from the payload (and include root_execution_id which is used by tripleoclient too) [1] https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/workflows/base.py#L48 Patch for tripleo-common send_message workflow - needs to merge after tripleoclient patches https://review.openstack.org/630291 Stop sending execution object via Zaqar message I seem to be hitting this on Queens release with PKCS split controller configuration with the following:
ServiceCount: 5
CoreCount: 3
TelemetryCount: 3
ComputeType1Count: 3
ComputeType2Count: 2
ComputeType3Count: 5
ComputeType4Count: 2
Undercloud install result:
...
Processing templates in the directory /tmp/tripleoclient-ntwHQD/tripleo-heat-templates
Started Mistral Workflow tripleo.plan_management.v1.get_deprecated_parameters. Execution ID: 54ef069a-ff91-4fbb-859b-54f7cc341415
Deploying templates in the directory /tmp/tripleoclient-ntwHQD/tripleo-heat-templates
Started Mistral Workflow tripleo.deployment.v1.deploy_plan. Execution ID: 0eac0392-e6d5-4e7c-9f0e-3f48fe454658
('The read operation timed out',)
real 14m1.281s
user 0m3.025s
sys 0m0.551s
Mistral executor log:
...
2019-02-05 17:57:19.072 6537 WARNING mistral.actions.openstack.base [req-72313140-9c9c-43ca-a4c1-5951aef7b754 614efccc01de4ca7987bbe6c2aec651a 0004aef92a3d41f6ba6854b547ffb92e - default default] Traceback (most
recent call last):
File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 117, in run
result = method(**self._kwargs_for_run)
File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 436, in wrap
return method(client, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 482, in queue_post
return queue.post(messages)
File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/queues.py", line 170, in post
self._name, messages)
File "/usr/lib/python2.7/site-packages/zaqarclient/queues/v1/core.py", line 242, in message_post
resp = transport.send(request)
File "/usr/lib/python2.7/site-packages/zaqarclient/transport/http.py", line 114, in send
raise self.http_to_zaqar[resp.status_code](**kwargs)
MalformedRequest: Error response from Zaqar. Code: 400. Title: Invalid API request. Description: Message collection size is too large. Max size 1048576.
Doubling the max messages size in /etc/zaqar/zaqar.conf got me going:
max_messages_post_size=2097152
I'm looking into the patches that still needs fixing before merging. then we can look at what to do in relation to backporting to OSP 13 Actually I just hit another in RHOSP13:
# tail -f /var/log/mistral/engine.log
File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 715, in fire_replace_event
state, value, previous, initiator or self._replace_token)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/events.py", line 1893, in wrap
fn(target, *arg)
File "/usr/lib/python2.7/site-packages/mistral/db/v2/sqlalchemy/models.py", line 94, in <lambda>
lambda t, v, o, i: validate_long_type_length(cls, attr_name, v)
File "/usr/lib/python2.7/site-packages/mistral/db/v2/sqlalchemy/models.py", line 80, in validate_long_type_length
raise exc.SizeLimitExceededException(msg)
SizeLimitExceededException: Field size limit exceeded [class=TaskExecution, field=output, size=17942KB, limit=16384KB]
So I had to increase this in zaqar.conf and restart the service:
#producer_batch_size=16384
producer_batch_size=32768
(In reply to Chris Smart from comment #10) > So I had to increase this in zaqar.conf and restart the service: > > #producer_batch_size=16384 > producer_batch_size=32768 Sorry, pasted wrong config file... :-S /etc/mistral/mistral.conf [engine] execution_field_size_limit_kb=32768 I thing that I've traced the error and I'm testing a patch for it. This https://review.opendev.org/#/c/630291/ and the depends on patch fixes the issue. We are running into the same issue with Queens (OSP13). This happened after we added the 8th node role. This fixed it (thanks to the previous comments by Chris): /etc/mistral/mistral.conf [engine] execution_field_size_limit_kb=32768 /etc/zaqar/zaqar.conf [transport] max_messages_post_size=2097152 Backport would be great. This bug is currently filed against OSP15. If fixes are required for other releases, please clone the bug. If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'. Discussed this BZ with DFG and resolution was to close. Thanks, sounds like we should just close it. Had never seen this BZ until we got an automated message saying it needed to be verified. It doesn't seem like something that a lot of time should be spent on. On 2/26/20 2:59 PM, Phil Weeks wrote: > David, as both alex and emilien have inferred. UI was deprecated in 15. > Only maintenance activity is ongoing for 14 and earlier. > If you indeed feel there's a verification needed, UI maintenance is part of cloudops DFG. > Phil > > On Wed, Feb 26, 2020 at 1:12 PM Emilien Macchi <emilien> wrote: > > I would close that bug. > There is no more DFG for UI and the rhos-prio cases are closed. > > On Wed, Feb 26, 2020 at 1:10 PM Joe Hakim Rahme <jhakimra> wrote: > > > It was found using a gui. We don't have any gui tests or a way to verify > > it. Is there any group or DFG that should be doing that? > > Unfortunately it seems that nobody currently has the tools or the > knowledge to properly qualify (G)UI tools of openstack. We are > addressing this situation by looking to hire engineers to focus on > this, and it probably won't be part of DFG:DF. > > In the meantime, does anyone in the DFG have enough experience with > GUI to help David verify this BZ properly? If nobody can, then I'm > afraid we'll have to delay the release of this BZ until we have > someone to take care of it. |