Bug 1370516
| Summary: | high memory usage of heat-engine worker on undercloud | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Martin Schuppert <mschuppe> |
| Component: | openstack-heat | Assignee: | Zane Bitter <zbitter> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | Amit Ugol <augol> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 8.0 (Liberty) | CC: | aguetta, agurenko, mburns, mschuppe, rhel-osp-director-maint, sbaker, shardy, srevivo, zbitter |
| Target Milestone: | async | Keywords: | ZStream |
| Target Release: | 8.0 (Liberty) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-03 13:08:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Martin Schuppert
2016-08-26 14:27:37 UTC
It's possible that limiting the number of workers to 4 will help, and that is indeed the first thing to try. Also make sure that the max_resources_per_stack option is set to -1: enabling that check not only makes everything slow but probably contributes to memory fragmentation. It's probably unlikely that the patches from https://bugs.launchpad.net/heat/+bug/1570974 will help much - they could be backported but we haven't bothered up to now because when Steve tested it he reported that the memory usage seemed only "about the same or slightly reduced". Thanks Zane, I was out the last 2 weeks. I already checked max_resources_per_stack and it is set to -1: $ grep max_resources_per_stack heat.conf #max_resources_per_stack = 1000 max_resources_per_stack = -1 What information should be collected to further troubleshoot this issue, since it really slows down the deployment tasks by decreasing number of heat-eingine workers. Hi Martin, I am trying to recreate the environment locally but so far I had no resources issues. Can I get the following please: Server model, HardDisk model, CPU model and also the deployment command (how many nodes of each type is heat trying to deploy) (In reply to Martin Schuppert from comment #3) > it really slows down the deployment tasks by decreasing number of heat-eingine workers. Does it, or are you just conjecturing that it would? The workers spend most of their time polling stuff that doesn't really happen any faster if there are lots of them. (In reply to Zane Bitter from comment #5) > (In reply to Martin Schuppert from comment #3) > > it really slows down the deployment tasks by decreasing number of heat-eingine workers. > > Does it, or are you just conjecturing that it would? The workers spend most > of their time polling stuff that doesn't really happen any faster if there > are lots of them. After the number of heat engine workers got lowered there were no OOM and the deploy finished successful. But in general from the information I received the deployment takes twice longer than it used to take in the past. It sounds like the bottom line is that 32GB is not enough RAM for 8 workers at the size of the deployment you're doing (which I'm assuming is quite substantial). If it works with 4 workers then that suggests there's no outright leak, and there are no magic-bullet patches we're missing from upstream. I think most of our testing has been done with 4 workers. There *may* be a happy medium (6?) that achieves the optimum trade-off of speed against memory use, but you'd have to experiment to see. (Alternatively, you could increase the amount of RAM in the undercloud box.) Amit, Zane,
this has been seen in quiet small env, but also bigger ones:
So this is happening on installation with 3 controllers and one or two computes (our lab setups) but we saw the same problem with more computes 8-12.
There was a ceph storage enabled on that installations .
> Server model, Memory, HardDisk model, CPU model * the deployment command
HP c7000 G9 with 128GB or ram and 48 CPUs.
With the composable services and roles coming with OSP10, can we expect the memory usage of the heat workers to be less than what we have now? Is there anything else being worked on in this area to improve the memory usage?
The good news is that during Newton development we made a major change that dramatically improved the memory usage (by not keeping multiple copies of the multi-megabyte input files, but instead sharing a single copy amongst all of the hundreds of stacks in the tree). The bad news is that during Newton memory use by Heat during a TripleO deployment has increased to far more than it was even before this major change: http://people.redhat.com/~shardy/heat/plots/heat_before_after_end_newton.png It's yet to be determined whether that increase is due to an errant change in Heat or the additional complexity of composable roles. We're doing everything we can upstream to bring the numbers down now: https://bugs.launchpad.net/heat/+bug/1626675 Current status is that despite the large increase in complexity due to composable roles: http://people.redhat.com/~shardy/heat/plots/heat_before_after_end_newton.png memory usage will be slightly *lower* for a basic deployment in Newton than it was in Mitaka: http://people.redhat.com/~shardy/heat/plots/heat_20161014.png We have more ideas that we hope to push ahead with in Ocata to bring it down even further (and also speed things up). Hopefully these should help offset the increased memory usage of the convergence architecture when we switch that on in TripleO. |