Description of problem: We have and OSPd 8 installation and on our under-cloud we see during deployment of over-cloud and after deployment that heat memory consumption grows and grows and grows ..... and finally we end up OOM killer jump in: Aug 26 00:28:21 undercloud kernel: ovs-vswitchd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 ... Aug 26 00:28:21 undercloud kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Aug 26 00:28:21 undercloud kernel: [ 400] 0 400 66569 289 133 862 0 systemd-journal Aug 26 00:28:21 undercloud kernel: [ 420] 0 420 30234 218 27 849 0 lvmetad Aug 26 00:28:21 undercloud kernel: [ 434] 0 434 11468 279 24 453 -1000 systemd-udevd Aug 26 00:28:21 undercloud kernel: [ 611] 0 611 4826 214 13 39 0 irqbalance Aug 26 00:28:21 undercloud kernel: [ 612] 0 612 142331 378 151 129 0 rsyslogd Aug 26 00:28:21 undercloud kernel: [ 621] 81 621 8739 280 19 53 -900 dbus-daemon Aug 26 00:28:21 undercloud kernel: [ 627] 0 627 50842 162 41 115 0 gssproxy Aug 26 00:28:21 undercloud kernel: [ 642] 0 642 6628 245 17 38 0 systemd-logind Aug 26 00:28:21 undercloud kernel: [ 643] 38 643 7352 317 18 114 0 ntpd Aug 26 00:28:21 undercloud kernel: [ 659] 0 659 108661 416 64 784 0 NetworkManager Aug 26 00:28:21 undercloud kernel: [ 669] 0 669 28812 111 12 24 0 ksmtuned Aug 26 00:28:21 undercloud kernel: [ 734] 0 734 13266 378 29 148 0 wpa_supplicant Aug 26 00:28:21 undercloud kernel: [ 735] 997 735 131869 225 53 1343 0 polkitd Aug 26 00:28:21 undercloud kernel: [ 777] 0 777 11372 30 24 130 0 monitor Aug 26 00:28:21 undercloud kernel: [ 778] 0 778 11474 411 25 139 0 ovsdb-server Aug 26 00:28:21 undercloud kernel: [ 796] 0 796 12288 105 26 96 0 monitor Aug 26 00:28:21 undercloud kernel: [ 797] 0 797 178407 21215 75 0 0 ovs-vswitchd Aug 26 00:28:21 undercloud kernel: [ 1192] 0 1192 28335 85 12 40 0 rhsmcertd Aug 26 00:28:21 undercloud kernel: [ 1193] 0 1193 20640 517 43 193 -1000 sshd Aug 26 00:28:21 undercloud kernel: [ 1202] 0 1202 7326 168 20 66 0 xinetd Aug 26 00:28:21 undercloud kernel: [ 1207] 0 1207 138263 700 89 2537 0 tuned Aug 26 00:28:21 undercloud kernel: [ 1222] 990 1222 865655 33892 479 20037 0 beam.smp Aug 26 00:28:21 undercloud kernel: [ 1230] 989 1230 153799 29 29 1529 0 memcached Aug 26 00:28:21 undercloud kernel: [ 1280] 0 1280 38388 499 79 343 0 httpd Aug 26 00:28:21 undercloud kernel: [ 1290] 0 1290 154148 467 143 1418 0 libvirtd Aug 26 00:28:21 undercloud kernel: [ 1339] 0 1339 26978 100 7 39 0 rhnsd Aug 26 00:28:21 undercloud kernel: [ 1351] 0 1351 31584 231 19 130 0 crond Aug 26 00:28:21 undercloud kernel: [ 1358] 0 1358 6491 183 17 53 0 atd Aug 26 00:28:21 undercloud kernel: [ 1386] 0 1386 27509 170 10 30 0 agetty Aug 26 00:28:21 undercloud kernel: [ 1388] 0 1388 27509 165 11 31 0 agetty Aug 26 00:28:21 undercloud kernel: [ 1404] 990 1404 8263 64 20 54 0 epmd Aug 26 00:28:21 undercloud kernel: [ 1418] 163 1418 292261 25545 303 45983 0 httpd Aug 26 00:28:21 undercloud kernel: [ 1421] 163 1421 259455 22228 210 3486 0 httpd Aug 26 00:28:21 undercloud kernel: [ 1532] 27 1532 28313 166 12 72 0 mysqld_safe ............................. Aug 26 00:28:21 undercloud kernel: [28168] 187 28168 97675 4049 146 12867 0 heat-engine Aug 26 00:28:21 undercloud kernel: [28189] 187 28189 445334 360806 819 2452 0 heat-engine Aug 26 00:28:21 undercloud kernel: [28192] 187 28192 449550 233457 828 134012 0 heat-engine Aug 26 00:28:21 undercloud kernel: [28194] 187 28194 448943 323644 826 43154 0 heat-engine Aug 26 00:28:21 undercloud kernel: [28195] 187 28195 452189 346568 833 23533 0 heat-engine Aug 26 00:28:21 undercloud kernel: [28196] 187 28196 448395 344039 825 22227 0 heat-engine Aug 26 00:28:21 undercloud kernel: [ 1914] 187 1914 448425 363457 824 2865 0 heat-engine Aug 26 00:28:21 undercloud kernel: [11074] 187 11074 443302 359134 814 1996 0 heat-engine Aug 26 00:28:21 undercloud kernel: [23323] 187 23323 349344 266187 631 1011 0 heat-engine Aug 26 00:28:21 undercloud kernel: Out of memory: Kill process 28195 (heat-engine) score 72 or sacrifice child Aug 26 00:28:21 undercloud kernel: Killed process 28195 (heat-engine) total-vm:1808756kB, anon-rss:1384484kB, file-rss:1788kB this is how it looks after 2h since restart: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4168 heat 20 0 1757616 1.464g 107872 S 0.6 4.7 14:26.33 heat-engine 4169 heat 20 0 1690224 1.301g 3884 S 1.3 4.2 11:04.26 heat-engine 4165 heat 20 0 1563832 1.180g 3884 S 0.6 3.8 12:04.02 heat-engine 4164 heat 20 0 1457280 1.079g 3884 S 1.3 3.5 12:50.39 heat-engine 4171 heat 20 0 1396908 1.021g 3884 S 0.6 3.3 14:38.31 heat-engine 3048 mysql 20 0 4849508 748324 10264 S 6.5 2.3 64:13.01 mysqld ... some time later: $ grep engine ps USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND ... heat 3743 0.5 0.2 390708 72516 ? Ss 01:26 2:04 /usr/bin/python /usr/bin/heat-engine heat 3811 7.6 5.8 2149252 1927140 ? S 01:26 26:51 /usr/bin/python /usr/bin/heat-engine heat 3812 7.3 5.5 2134196 1807752 ? S 01:26 25:34 /usr/bin/python /usr/bin/heat-engine heat 3813 7.7 5.5 2134956 1808648 ? S 01:26 26:58 /usr/bin/python /usr/bin/heat-engine heat 3814 8.0 5.9 2288268 1962136 ? S 01:26 28:06 /usr/bin/python /usr/bin/heat-engine heat 3815 11.5 5.5 2133836 1807456 ? S 01:26 40:24 /usr/bin/python /usr/bin/heat-engine heat 3816 7.2 5.5 2141060 1814560 ? S 01:26 25:29 /usr/bin/python /usr/bin/heat-engine heat 3817 8.6 6.6 2503884 2177768 ? S 01:26 30:06 /usr/bin/python /usr/bin/heat-engine heat 3818 6.9 5.5 2133332 1806680 ? S 01:26 24:09 /usr/bin/python /usr/bin/heat-engine Version-Release number of selected component (if applicable): * openstack-heat-common-5.0.1-6.el7ost.noarch Additional info: Likely related to https://bugs.launchpad.net/heat/+bug/1570974 , can this be backported to OSP8.
It's possible that limiting the number of workers to 4 will help, and that is indeed the first thing to try. Also make sure that the max_resources_per_stack option is set to -1: enabling that check not only makes everything slow but probably contributes to memory fragmentation. It's probably unlikely that the patches from https://bugs.launchpad.net/heat/+bug/1570974 will help much - they could be backported but we haven't bothered up to now because when Steve tested it he reported that the memory usage seemed only "about the same or slightly reduced".
Thanks Zane, I was out the last 2 weeks. I already checked max_resources_per_stack and it is set to -1: $ grep max_resources_per_stack heat.conf #max_resources_per_stack = 1000 max_resources_per_stack = -1 What information should be collected to further troubleshoot this issue, since it really slows down the deployment tasks by decreasing number of heat-eingine workers.
Hi Martin, I am trying to recreate the environment locally but so far I had no resources issues. Can I get the following please: Server model, HardDisk model, CPU model and also the deployment command (how many nodes of each type is heat trying to deploy)
(In reply to Martin Schuppert from comment #3) > it really slows down the deployment tasks by decreasing number of heat-eingine workers. Does it, or are you just conjecturing that it would? The workers spend most of their time polling stuff that doesn't really happen any faster if there are lots of them.
(In reply to Zane Bitter from comment #5) > (In reply to Martin Schuppert from comment #3) > > it really slows down the deployment tasks by decreasing number of heat-eingine workers. > > Does it, or are you just conjecturing that it would? The workers spend most > of their time polling stuff that doesn't really happen any faster if there > are lots of them. After the number of heat engine workers got lowered there were no OOM and the deploy finished successful. But in general from the information I received the deployment takes twice longer than it used to take in the past.
It sounds like the bottom line is that 32GB is not enough RAM for 8 workers at the size of the deployment you're doing (which I'm assuming is quite substantial). If it works with 4 workers then that suggests there's no outright leak, and there are no magic-bullet patches we're missing from upstream. I think most of our testing has been done with 4 workers. There *may* be a happy medium (6?) that achieves the optimum trade-off of speed against memory use, but you'd have to experiment to see. (Alternatively, you could increase the amount of RAM in the undercloud box.)
Amit, Zane, this has been seen in quiet small env, but also bigger ones: So this is happening on installation with 3 controllers and one or two computes (our lab setups) but we saw the same problem with more computes 8-12. There was a ceph storage enabled on that installations . > Server model, Memory, HardDisk model, CPU model * the deployment command HP c7000 G9 with 128GB or ram and 48 CPUs. With the composable services and roles coming with OSP10, can we expect the memory usage of the heat workers to be less than what we have now? Is there anything else being worked on in this area to improve the memory usage?
The good news is that during Newton development we made a major change that dramatically improved the memory usage (by not keeping multiple copies of the multi-megabyte input files, but instead sharing a single copy amongst all of the hundreds of stacks in the tree). The bad news is that during Newton memory use by Heat during a TripleO deployment has increased to far more than it was even before this major change: http://people.redhat.com/~shardy/heat/plots/heat_before_after_end_newton.png It's yet to be determined whether that increase is due to an errant change in Heat or the additional complexity of composable roles. We're doing everything we can upstream to bring the numbers down now: https://bugs.launchpad.net/heat/+bug/1626675
Current status is that despite the large increase in complexity due to composable roles: http://people.redhat.com/~shardy/heat/plots/heat_before_after_end_newton.png memory usage will be slightly *lower* for a basic deployment in Newton than it was in Mitaka: http://people.redhat.com/~shardy/heat/plots/heat_20161014.png We have more ideas that we hope to push ahead with in Ocata to bring it down even further (and also speed things up). Hopefully these should help offset the increased memory usage of the convergence architecture when we switch that on in TripleO.