Bug 1853635 - [OSP16.1-RC][Large Scale Test] Overcloud Heat stack failed with following error: "UPDATE_FAILED Expression consumed too much memory".
Summary: [OSP16.1-RC][Large Scale Test] Overcloud Heat stack failed with following err...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z2
: 16.1 (Train on RHEL 8.2)
Assignee: Luke Short
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-03 12:10 UTC by Pradipta Kumar Sahoo
Modified: 2020-11-10 19:58 UTC (History)
8 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20200914170156.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-28 15:38:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1886203 0 None None None 2020-07-03 14:56:34 UTC
OpenStack gerrit 739249 0 None MERGED undercloud/heat: set YAQL memory quota to 200000 2021-01-25 09:47:28 UTC
OpenStack gerrit 739632 0 None MERGED undercloud/heat: set YAQL memory quota to 200000 2021-01-25 09:47:27 UTC
Red Hat Product Errata RHEA-2020:4284 0 None None None 2020-10-28 15:38:33 UTC

Description Pradipta Kumar Sahoo 2020-07-03 12:10:09 UTC
Description of problem:
In large scale test, the overcloud heat stack failed while tried to scale out nodes from 200 to 250. 
Till 200 nodes, we didn't face any issues in overcloud heat stack.
We hit the issue after adding 50 compute nodes in the stack which exists with 200 compute nodes. To reproduce the issue, we use two types of composable hardware with 50 node count (50x1029p & 50x1029u).

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.1.0 RC (Train)
Red Hat Enterprise Linux release 8.2 (Ootpa)
python3-tripleoclient-12.3.2-0.20200615103427.6f877f6.el8ost.noarch
python3-tripleoclient-heat-installer-12.3.2-0.20200615103427.6f877f6.el8ost.noarch


How reproducible: 100% reproducible in Scale lab.


Steps to Reproduce:
1. Deployed and successfully scaled out compute nodes with 200 counts.

$ openstack stack list
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| ID                                   | Stack Name | Project                          | Stack Status  | Creation Time        | Updated Time         |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| 94a1e1aa-c10e-4597-8050-4c95b8118388 | overcloud  | 5afea8d232064664b24278742e2cca22 | UPDATE_FAILED | 2020-07-01T15:35:11Z | 2020-07-03T10:06:18Z |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+

2. Added 50 x Compute nodes (1029p/1029u). But the heat stack update failed with below memory issue.
    $ openstack stack event list --nested-depth 5 overcloud|grep -i FAILED
    2020-07-03 01:27:50Z [overcloud]: UPDATE_FAILED  Expression consumed too much memory
    2020-07-03 07:07:37Z [overcloud]: UPDATE_FAILED  Expression consumed too much memory
    2020-07-03 09:17:42Z [overcloud]: UPDATE_FAILED  Expression consumed too much memory
    2020-07-03 11:02:43Z [overcloud]: UPDATE_FAILED  Expression consumed too much memory

3. Heat-engine log reported below exceptions.

$ grep ^"2020-07-03 11" /var/log/containers/heat/heat-engine.log
..
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource [req-0099248f-6fd4-4be5-bc04-8dce3668a8ae - admin - default default] Unexpected exception in resource check.: yaql.language.exceptions.MemoryQuotaExceededException: Expression consumed too much memory
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource Traceback (most recent call last):
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource   File "/usr/lib/python3.6/site-packages/heat/engine/check_resource.py", line 313, in check
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource     adopt_stack_data)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource   File "/usr/lib/python3.6/site-packages/heat/engine/check_resource.py", line 152, in _do_check_resource
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource     stack, self.msg_queue)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource   File "/usr/lib/python3.6/site-packages/heat/engine/check_resource.py", line 395, in check_resource_update
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource     check_message)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource   File "/usr/lib/python3.6/site-packages/heat/engine/resource.py", line 1462, in update_convergence
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource     runner(timeout=timeout, progress_callback=progress_callback)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource   File "/usr/lib/python3.6/site-packages/heat/engine/scheduler.py", line 163, in __call__
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource     progress_callback=progress_callback):
..
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource yaql.language.exceptions.MemoryQuotaExceededException: Expression consumed too much memory
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource 
2020-07-03 11:02:43.351 42 INFO heat.engine.stack [req-0099248f-6fd4-4be5-bc04-8dce3668a8ae - admin - default default] Stack UPDATE FAILED (overcloud): Expression consumed too much memory
2020-07-03 11:02:43.364 59 DEBUG heat.engine.sync_point [req-0099248f-6fd4-4be5-bc04-8dce3668a8ae - admin - default default] [8372:a425d54f-40f5-47a8-80f8-e773c19c0003:False] Waiting 8372: Got ConvergenceNode(rsrc_id=8372, is_update=True); still need ConvergenceNode(rsrc_id=8369, is_update=False) sync /usr/lib/python3.6/site-packages/heat/engine/sync_point.py:148


4. No failure for tripleo containers.

$ systemctl list-units|grep -i fail
_ NetworkManager-wait-online.service                                                                                                   loaded failed     failed          Network Manager Wait Online                                                                                                 


5. heat memory consumption queue in rabbitmq.

$ sudo podman exec -it -u root rabbitmq rabbitmqctl list_queues name messages memory consumers|grep heat
heat-engine-listener.2333acf3-9fe3-4a66-bb63-af1745f9fe01       0       34876   1
heat-engine-listener_fanout_22414f883aae43bb8106e8559c4d74e3    0       34876   1
heat-engine-listener_fanout_ea46f05f1b724da19d9022a7104e14dc    0       34876   1
heat-engine-listener_fanout_5584624c14b7462e99bf3dba25ba7320    0       34876   1
heat-engine-listener_fanout_8b8627681b7f439bb85a7ceaf14e018c    0       34876   1
heat-engine-listener.8fa46a3e-8219-4f54-a2df-85895b38c12e       0       34876   1
heat-engine-listener_fanout_79bb5ccf6bf143ca87e7edaec3fcac24    0       34876   1
heat-engine-listener.c02eb874-bde6-4fd9-b42f-63a05413232d       0       34876   1
heat-engine-listener_fanout_e62ecc4e710f4f61bccd9e745a55f3a9    0       34876   1
heat-engine-listener_fanout_61e5480cfa4441cda92753af42797aec    0       34876   1
heat-engine-listener.79663783-2b14-43e3-ae5f-9e86e4e55cc5       0       34876   1
heat-engine-listener_fanout_514b9c7045ae4c158e3d90b099ca2670    0       34876   1
heat-engine-listener_fanout_5d92c54b968243fca50589b4ce122fe9    0       34876   1
heat-engine-listener.decec203-0f5b-4f06-a849-ed3f0c3c00e2       0       34876   1
heat-engine-listener.c5faf3f5-6388-43f7-8735-ac25feaf9bbe       0       34876   1
heat-engine-listener.49888dcc-1706-40ff-95b4-60860d79c79c       0       34876   1
heat-engine-listener_fanout_027da2962c724555bb3c7fedcc1563d1    0       34876   1
heat-engine-listener.400bf389-70d2-46a3-9e86-514468b0ecf5       0       34876   1
heat-engine-listener_fanout_da09506a1dd94b588646e400a3d6a4f7    0       34876   1
heat-engine-listener_fanout_070dce69c83a44f6b10039bdf04cb207    0       34876   1
heat-engine-listener.e26fb6b1-3adb-44d1-8c13-08da00649ad4       0       34876   1
heat-engine-listener_fanout_49c1f88a28314d1291b8bcbf42217116    0       34876   1
heat-engine-listener.da97ed73-c704-44e3-83ab-41a7edda5328       0       34876   1
heat-engine-listener_fanout_bbe95f9fc7d24c6bb62cf8710e57750c    0       34876   1
heat-engine-listener_fanout_1da2d891d71a403eaa77426567dfdecc    0       34876   1
heat-engine-listener.16a2c102-ac3b-4a1e-9bb0-1c1046fcbc53       0       34876   1
heat-engine-listener_fanout_b4f413fba1e248e388002e9bc01858f3    0       34876   1
heat-engine-listener.359421d4-ac08-4d3c-9d5a-36cebe1f83a8       0       34876   1
heat-engine-listener_fanout_7c26a4bc6a324520a7cbb211e9425129    0       34876   1
heat-engine-listener.83cccf38-c532-4e1a-9190-18dd907e5ac1       0       34876   1
heat-engine-listener    0       58588   24
heat-engine-listener.3d476f14-67c9-4173-87bb-1d684ab460d1       0       34876   1
heat-engine-listener.94dfca03-8481-4f94-b1ae-4abacc5df0d1       0       34876   1
heat-engine-listener.a0dbff6d-8772-44a4-a4b5-8742e01170ee       0       34876   1
heat-engine-listener_fanout_d826fa669c28460c90efd67e683c774f    0       34876   1
heat-engine-listener_fanout_ac24f6d620b44365856c156f6f54a2d7    0       34876   1
heat-engine-listener.ff9e0297-2caf-4955-968b-656bb4862bba       0       34876   1
heat-engine-listener.c2be9a98-9c99-47c3-9442-1bb6a8c2410e       0       34876   1
heat-engine-listener.69826b66-7ddb-4940-90e5-b011ce1e3f66       0       34876   1
heat-engine-listener_fanout_38a7e06a998b498ca1a4a02bf581f3da    0       34876   1
heat-engine-listener_fanout_fa9e41e2c8714bea86e9d249d45d5629    0       34876   1
heat-engine-listener.5a3e1ff9-23a4-4035-90a3-15df9905b1ed       0       34876   1
heat-engine-listener.49492ba6-2bbf-4281-8473-9ab8c7e1b8fb       0       34876   1
heat-engine-listener.cb92c6d4-a42b-4486-9505-fc7d3e10af96       0       34876   1
heat-engine-listener.943126b4-4431-4215-bafd-708775a79cab       0       34876   1
heat-engine-listener.af713d29-358c-4af4-bf5d-6667482f8009       0       34876   1
heat-engine-listener_fanout_b7a6e6c903504e858711b684925943d6    0       34876   1
heat-engine-listener_fanout_19294f8e1aea4255bb60287ec2083e85    0       34876   1
heat-engine-listener_fanout_8f6b117847d34fd1b8e272955811f084    0       34876   1


6. Statistics:
Heat memory usage during stack deployment: https://snapshot.raintank.io/dashboard/snapshot/N97mW02uo3lkw4LOsWWZ475kiohfU6ub
Heat CPU process Usage: https://snapshot.raintank.io/dashboard/snapshot/8qbygU6WS27CV8M4AR2XC0Fdc0uTVaVu

Actual results: Scaled test failed at 250 node count.


Expected results: We never experience heat-engine memory issue in OSP16.0 scale test and we scaled 250 nodes without having performance tunning. 
So we would expect better heat-engine performance this time.

Additional info: SOS report of Undercloud node will upload soon.

Comment 1 Rabi Mishra 2020-07-03 13:05:10 UTC
> That's an yaql expression consuming a lot of memory and nothing to do with heat-engine memory usage. I have not checked which one and why. But, you can increase(double) memory_quota (in bytes) for yaql heat.conf.

[yaql]
limit_iterators=10000
memory_quota=200000

Comment 4 David Rosenfeld 2020-10-12 20:12:22 UTC
From /var/lib/config-data/puppet-generated/heat/etc/heat/heat.conf after undercloud deployed:

[yaql]
limit_iterators=10000
memory_quota=200000

Comment 11 errata-xmlrpc 2020-10-28 15:38:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284


Note You need to log in before you can comment on or make changes to this bug.