1370516 – high memory usage of heat-engine worker on undercloud

Bug 1370516 - high memory usage of heat-engine worker on undercloud

Summary: high memory usage of heat-engine worker on undercloud

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-heat
Sub Component:
Version:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	async
Target Release:	8.0 (Liberty)
Assignee:	Zane Bitter
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-26 14:27 UTC by Martin Schuppert
Modified:	2020-04-15 14:38 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-03 13:08:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1570974	0	None	None	None	2016-08-26 14:27:37 UTC
Red Hat Knowledge Base (Solution)	2576371	0	None	None	None	2016-08-26 14:36:25 UTC

Description Martin Schuppert 2016-08-26 14:27:37 UTC

Description of problem:

We have and OSPd 8 installation and on our under-cloud we see during deployment of over-cloud and 
after deployment that heat memory consumption grows and grows and grows ..... and finally we end up
OOM killer jump in: 

Aug 26 00:28:21 undercloud kernel: ovs-vswitchd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
...

Aug 26 00:28:21 undercloud kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Aug 26 00:28:21 undercloud kernel: [  400]     0   400    66569      289     133      862             0 systemd-journal
Aug 26 00:28:21 undercloud kernel: [  420]     0   420    30234      218      27      849             0 lvmetad
Aug 26 00:28:21 undercloud kernel: [  434]     0   434    11468      279      24      453         -1000 systemd-udevd
Aug 26 00:28:21 undercloud kernel: [  611]     0   611     4826      214      13       39             0 irqbalance
Aug 26 00:28:21 undercloud kernel: [  612]     0   612   142331      378     151      129             0 rsyslogd
Aug 26 00:28:21 undercloud kernel: [  621]    81   621     8739      280      19       53          -900 dbus-daemon
Aug 26 00:28:21 undercloud kernel: [  627]     0   627    50842      162      41      115             0 gssproxy
Aug 26 00:28:21 undercloud kernel: [  642]     0   642     6628      245      17       38             0 systemd-logind
Aug 26 00:28:21 undercloud kernel: [  643]    38   643     7352      317      18      114             0 ntpd
Aug 26 00:28:21 undercloud kernel: [  659]     0   659   108661      416      64      784             0 NetworkManager
Aug 26 00:28:21 undercloud kernel: [  669]     0   669    28812      111      12       24             0 ksmtuned
Aug 26 00:28:21 undercloud kernel: [  734]     0   734    13266      378      29      148             0 wpa_supplicant
Aug 26 00:28:21 undercloud kernel: [  735]   997   735   131869      225      53     1343             0 polkitd
Aug 26 00:28:21 undercloud kernel: [  777]     0   777    11372       30      24      130             0 monitor
Aug 26 00:28:21 undercloud kernel: [  778]     0   778    11474      411      25      139             0 ovsdb-server
Aug 26 00:28:21 undercloud kernel: [  796]     0   796    12288      105      26       96             0 monitor
Aug 26 00:28:21 undercloud kernel: [  797]     0   797   178407    21215      75        0             0 ovs-vswitchd
Aug 26 00:28:21 undercloud kernel: [ 1192]     0  1192    28335       85      12       40             0 rhsmcertd
Aug 26 00:28:21 undercloud kernel: [ 1193]     0  1193    20640      517      43      193         -1000 sshd
Aug 26 00:28:21 undercloud kernel: [ 1202]     0  1202     7326      168      20       66             0 xinetd
Aug 26 00:28:21 undercloud kernel: [ 1207]     0  1207   138263      700      89     2537             0 tuned
Aug 26 00:28:21 undercloud kernel: [ 1222]   990  1222   865655    33892     479    20037             0 beam.smp
Aug 26 00:28:21 undercloud kernel: [ 1230]   989  1230   153799       29      29     1529             0 memcached
Aug 26 00:28:21 undercloud kernel: [ 1280]     0  1280    38388      499      79      343             0 httpd
Aug 26 00:28:21 undercloud kernel: [ 1290]     0  1290   154148      467     143     1418             0 libvirtd
Aug 26 00:28:21 undercloud kernel: [ 1339]     0  1339    26978      100       7       39             0 rhnsd
Aug 26 00:28:21 undercloud kernel: [ 1351]     0  1351    31584      231      19      130             0 crond
Aug 26 00:28:21 undercloud kernel: [ 1358]     0  1358     6491      183      17       53             0 atd
Aug 26 00:28:21 undercloud kernel: [ 1386]     0  1386    27509      170      10       30             0 agetty
Aug 26 00:28:21 undercloud kernel: [ 1388]     0  1388    27509      165      11       31             0 agetty
Aug 26 00:28:21 undercloud kernel: [ 1404]   990  1404     8263       64      20       54             0 epmd
Aug 26 00:28:21 undercloud kernel: [ 1418]   163  1418   292261    25545     303    45983             0 httpd
Aug 26 00:28:21 undercloud kernel: [ 1421]   163  1421   259455    22228     210     3486             0 httpd
Aug 26 00:28:21 undercloud kernel: [ 1532]    27  1532    28313      166      12       72             0 mysqld_safe
.............................
Aug 26 00:28:21 undercloud kernel: [28168]   187 28168    97675     4049     146    12867             0 heat-engine
Aug 26 00:28:21 undercloud kernel: [28189]   187 28189   445334   360806     819     2452             0 heat-engine
Aug 26 00:28:21 undercloud kernel: [28192]   187 28192   449550   233457     828   134012             0 heat-engine
Aug 26 00:28:21 undercloud kernel: [28194]   187 28194   448943   323644     826    43154             0 heat-engine
Aug 26 00:28:21 undercloud kernel: [28195]   187 28195   452189   346568     833    23533             0 heat-engine
Aug 26 00:28:21 undercloud kernel: [28196]   187 28196   448395   344039     825    22227             0 heat-engine
Aug 26 00:28:21 undercloud kernel: [ 1914]   187  1914   448425   363457     824     2865             0 heat-engine
Aug 26 00:28:21 undercloud kernel: [11074]   187 11074   443302   359134     814     1996             0 heat-engine
Aug 26 00:28:21 undercloud kernel: [23323]   187 23323   349344   266187     631     1011             0 heat-engine
Aug 26 00:28:21 undercloud kernel: Out of memory: Kill process 28195 (heat-engine) score 72 or sacrifice child
Aug 26 00:28:21 undercloud kernel: Killed process 28195 (heat-engine) total-vm:1808756kB, anon-rss:1384484kB, file-rss:1788kB

this is how it looks after 2h since restart:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
 4168 heat      20   0 1757616 1.464g 107872 S   0.6  4.7  14:26.33 heat-engine                                                                 
 4169 heat      20   0 1690224 1.301g   3884 S   1.3  4.2  11:04.26 heat-engine 
 4165 heat      20   0 1563832 1.180g   3884 S   0.6  3.8  12:04.02 heat-engine 
 4164 heat      20   0 1457280 1.079g   3884 S   1.3  3.5  12:50.39 heat-engine  
 4171 heat      20   0 1396908 1.021g   3884 S   0.6  3.3  14:38.31 heat-engine 
 3048 mysql     20   0 4849508 748324  10264 S   6.5  2.3  64:13.01 mysqld
...

some time later:
$ grep engine ps
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
...
heat      3743  0.5  0.2 390708 72516 ?        Ss   01:26   2:04 /usr/bin/python /usr/bin/heat-engine
heat      3811  7.6  5.8 2149252 1927140 ?     S    01:26  26:51 /usr/bin/python /usr/bin/heat-engine
heat      3812  7.3  5.5 2134196 1807752 ?     S    01:26  25:34 /usr/bin/python /usr/bin/heat-engine
heat      3813  7.7  5.5 2134956 1808648 ?     S    01:26  26:58 /usr/bin/python /usr/bin/heat-engine
heat      3814  8.0  5.9 2288268 1962136 ?     S    01:26  28:06 /usr/bin/python /usr/bin/heat-engine
heat      3815 11.5  5.5 2133836 1807456 ?     S    01:26  40:24 /usr/bin/python /usr/bin/heat-engine
heat      3816  7.2  5.5 2141060 1814560 ?     S    01:26  25:29 /usr/bin/python /usr/bin/heat-engine
heat      3817  8.6  6.6 2503884 2177768 ?     S    01:26  30:06 /usr/bin/python /usr/bin/heat-engine
heat      3818  6.9  5.5 2133332 1806680 ?     S    01:26  24:09 /usr/bin/python /usr/bin/heat-engine


Version-Release number of selected component (if applicable):
* openstack-heat-common-5.0.1-6.el7ost.noarch

Additional info:
Likely related to  https://bugs.launchpad.net/heat/+bug/1570974 , can this be backported to OSP8.

Comment 2 Zane Bitter 2016-09-01 13:29:29 UTC

It's possible that limiting the number of workers to 4 will help, and that is indeed the first thing to try. Also make sure that the max_resources_per_stack option is set to -1: enabling that check not only makes everything slow but probably contributes to memory fragmentation.

It's probably unlikely that the patches from https://bugs.launchpad.net/heat/+bug/1570974 will help much - they could be backported but we haven't bothered up to now because when Steve tested it he reported that the memory usage seemed only "about the same or slightly reduced".

Comment 3 Martin Schuppert 2016-09-12 11:41:04 UTC

Thanks Zane,

I was out the last 2 weeks. I already checked max_resources_per_stack and it is set to -1:

$ grep max_resources_per_stack heat.conf 
#max_resources_per_stack = 1000
max_resources_per_stack = -1

What information should be collected to further troubleshoot this issue, since it really slows down the deployment tasks by decreasing number of heat-eingine workers.

Comment 4 Amit Ugol 2016-09-15 13:14:19 UTC

Hi Martin, I am trying to recreate the environment locally but so far I had no resources issues. Can I get the following please:
Server model, HardDisk model, CPU model and also the deployment command (how many nodes of each type is heat trying to deploy)

Comment 5 Zane Bitter 2016-09-15 13:35:43 UTC

(In reply to Martin Schuppert from comment #3)
> it really slows down the deployment tasks by decreasing number of heat-eingine workers.

Does it, or are you just conjecturing that it would? The workers spend most of their time polling stuff that doesn't really happen any faster if there are lots of them.

Comment 7 Martin Schuppert 2016-09-16 07:21:31 UTC

(In reply to Zane Bitter from comment #5)
> (In reply to Martin Schuppert from comment #3)
> > it really slows down the deployment tasks by decreasing number of heat-eingine workers.
> 
> Does it, or are you just conjecturing that it would? The workers spend most
> of their time polling stuff that doesn't really happen any faster if there
> are lots of them.

After the number of heat engine workers got lowered there were no OOM and the deploy finished successful. But in general from the information I received the deployment takes twice longer than it used to take in the past.

Comment 8 Zane Bitter 2016-09-19 15:28:03 UTC

It sounds like the bottom line is that 32GB is not enough RAM for 8 workers at the size of the deployment you're doing (which I'm assuming is quite substantial). If it works with 4 workers then that suggests there's no outright leak, and there are no magic-bullet patches we're missing from upstream.

I think most of our testing has been done with 4 workers. There *may* be a happy medium (6?) that achieves the optimum trade-off of speed against memory use, but you'd have to experiment to see. (Alternatively, you could increase the amount of RAM in the undercloud box.)

Comment 10 Martin Schuppert 2016-10-04 08:15:49 UTC

Amit, Zane,

this has been seen in quiet small env, but also bigger ones: 
So this is happening on installation with 3 controllers and one or two computes (our lab setups) but we saw the same problem with more computes 8-12.
There was a ceph storage enabled on that installations .

> Server model, Memory, HardDisk model, CPU model * the deployment command 
HP c7000 G9 with 128GB or ram and 48 CPUs.

With the composable services and roles coming with OSP10, can we expect the memory usage of the heat workers to be less than what we have now? Is there anything else being worked on in this area to improve the memory usage?

Comment 11 Zane Bitter 2016-10-05 22:05:32 UTC

The good news is that during Newton development we made a major change that dramatically improved the memory usage (by not keeping multiple copies of the multi-megabyte input files, but instead sharing a single copy amongst all of the hundreds of stacks in the tree).

The bad news is that during Newton memory use by Heat during a TripleO deployment has increased to far more than it was even before this major change:

http://people.redhat.com/~shardy/heat/plots/heat_before_after_end_newton.png

It's yet to be determined whether that increase is due to an errant change in Heat or the additional complexity of composable roles. We're doing everything we can upstream to bring the numbers down now:

https://bugs.launchpad.net/heat/+bug/1626675

Comment 13 Zane Bitter 2016-10-20 13:46:25 UTC

Current status is that despite the large increase in complexity due to composable roles:

http://people.redhat.com/~shardy/heat/plots/heat_before_after_end_newton.png

memory usage will be slightly *lower* for a basic deployment in Newton than it was in Mitaka:

http://people.redhat.com/~shardy/heat/plots/heat_20161014.png

We have more ideas that we hope to push ahead with in Ocata to bring it down even further (and also speed things up). Hopefully these should help offset the increased memory usage of the convergence architecture when we switch that on in TripleO.

Note You need to log in before you can comment on or make changes to this bug.