Description of problem: First I want to apologize if I have misclassified the OpenStack component. It might also be a RHEL issue, so feel free to re-route if needed RH OSP17.0.1 has stopped working due to OOM error on a single controller node, which triggered galera and rabbitmq to error out. Jun 28 08:21:20 leaf1-controller-0 kernel: podman invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 Jun 28 08:21:20 leaf1-controller-0 kernel: CPU: 3 PID: 1024000 Comm: podman Not tainted 5.14.0-162.22.2.el9_1.x86_64 #1 Jun 28 08:21:20 leaf1-controller-0 kernel: IPv4: martian source 169.254.95.120 from 169.254.95.118, on dev enp0s20u1u5 Jun 28 08:21:20 leaf1-controller-0 kernel: Hardware name: LENOVO Lenovo Flex System x240 M5 Compute Node -[9532AC1]-/-[9532AC1]-, BIOS -[C4E144A-3.10]- 04/09/2020 Jun 28 08:21:20 leaf1-controller-0 kernel: Call Trace: Jun 28 08:21:20 leaf1-controller-0 kernel: dump_stack_lvl+0x34/0x48 Jun 28 08:21:20 leaf1-controller-0 kernel: ll header: 00000000: ff ff ff ff ff ff 08 94 ef 25 85 cd 08 06 Jun 28 08:21:20 leaf1-controller-0 kernel: dump_header+0x4a/0x201 Jun 28 08:21:20 leaf1-controller-0 kernel: oom_kill_process.cold+0xb/0x10 Jun 28 08:21:20 leaf1-controller-0 kernel: out_of_memory+0xed/0x2d0 Jun 28 08:21:20 leaf1-controller-0 kernel: __alloc_pages_slowpath.constprop.0+0x7cc/0x8a0 Jun 28 08:21:20 leaf1-controller-0 kernel: __alloc_pages+0x1fe/0x230 Jun 28 08:21:20 leaf1-controller-0 kernel: folio_alloc+0x17/0x50 Jun 28 08:21:20 leaf1-controller-0 kernel: __filemap_get_folio+0x1b6/0x330 Jun 28 08:21:20 leaf1-controller-0 kernel: IPv4: martian source 169.254.95.120 from 169.254.95.118, on dev enp0s20u1u5 Jun 28 08:21:20 leaf1-controller-0 kernel: ? do_sync_mmap_readahead+0x14b/0x270 Jun 28 08:21:20 leaf1-controller-0 kernel: ll header: 00000000: ff ff ff ff ff ff 08 94 ef 25 85 cd 08 06 Jun 28 08:21:20 leaf1-controller-0 kernel: filemap_fault+0x454/0x7a0 Jun 28 08:21:20 leaf1-controller-0 kernel: ? next_uptodate_page+0x160/0x1f0 Jun 28 08:21:20 leaf1-controller-0 kernel: ? filemap_map_pages+0x307/0x4a0 Jun 28 08:21:20 leaf1-controller-0 kernel: __xfs_filemap_fault+0x66/0x280 [xfs] Jun 28 08:21:20 leaf1-controller-0 kernel: __do_fault+0x36/0x110 Jun 28 08:21:20 leaf1-controller-0 kernel: do_read_fault+0xea/0x190 Jun 28 08:21:20 leaf1-controller-0 kernel: do_fault+0x8c/0x2c0 Jun 28 08:21:20 leaf1-controller-0 kernel: __handle_mm_fault+0x3cb/0x750 Jun 28 08:21:20 leaf1-controller-0 kernel: handle_mm_fault+0xc5/0x2a0 Jun 28 08:21:20 leaf1-controller-0 kernel: do_user_addr_fault+0x1bb/0x690 Jun 28 08:21:20 leaf1-controller-0 kernel: exc_page_fault+0x62/0x150 Jun 28 08:21:20 leaf1-controller-0 kernel: asm_exc_page_fault+0x22/0x30 Jun 28 08:21:20 leaf1-controller-0 kernel: RIP: 0033:0x5597f2ea7f91 Jun 28 08:21:20 leaf1-controller-0 kernel: IPv4: martian source 169.254.95.120 from 169.254.95.118, on dev enp0s20u1u5 Jun 28 08:21:20 leaf1-controller-0 kernel: Code: Unable to access opcode bytes at RIP 0x5597f2ea7f67. Jun 28 08:21:20 leaf1-controller-0 kernel: RSP: 002b:000000c000987870 EFLAGS: 00010246 shortly after pacemaker started failing: Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Result of monitor operation for galera on galera-bundle-2: Timed Out after 30s (Resource agent did not complete in time) Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Result of monitor operation for redis on redis-bundle-2: Timed Out after 60s (Resource agent did not complete in time) Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: notice: High CPU load detected: 921.770020 Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Result of monitor operation for rabbitmq-bundle-2 on leaf1-controller-0: Timed Out after 30s (Remote executor did not respond) Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Lost connection to Pacemaker Remote node rabbitmq-bundle-2 Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: localized_remote_header: Triggered fatal assertion at remote.c:107 : endian == ENDIAN_LOCAL Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: Invalid message detected, endian mismatch: badadbbd is neither 6d726c3c nor the swab'd 3c6c726d Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: localized_remote_header: Triggered fatal assertion at remote.c:107 : endian == ENDIAN_LOCAL Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: Invalid message detected, endian mismatch: badadbbd is neither 6d726c3c nor the swab'd 3c6c726d and rabbitmq? /var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: show_signal_msg: 13 callbacks suppressed /var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: handle[10844]: segfault at 18 ip 00007f09e061aa8b sp 00007f09b1d809c8 error 4 /var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: ll header: 00000000: ff ff ff ff ff ff 08 94 ef 25 85 cd 08 06 /var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: in libqpid-proton.so.11.13.0[7f09e05ff000+38000] I couldn't identify what component caused the underlying OOM failure. Maybe a memory leak? Here are some oom errors: /var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: podman invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 /var/log/messages:Jun 28 08:25:57 leaf1-controller-0 kernel: ironic-inspecto invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=0, oom_score_adj=0 /var/log/messages:Jun 28 08:52:40 leaf1-controller-0 kernel: httpd invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 /var/log/messages:Jun 28 09:07:54 leaf1-controller-0 kernel: pcsd invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 /var/log/messages:Jun 28 09:22:28 leaf1-controller-0 kernel: pcsd invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 /var/log/messages:Jun 28 10:21:42 leaf1-controller-0 kernel: nova-conductor invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 Version-Release number of selected component (if applicable): OSP17.0.1 How reproducible: Run OSP17.0.1 over longer period of time Steps to Reproduce: 1. deploy OSP17.0.1 2. run it for long period of time 3. Actual results: OOM error causing multiple components to fail Expected results: no error / no failure Additional info: sosreport from all the controllers. http://chrisj.cloud/sosreport-leaf1-controller-0-2023-06-28-nwztdxl.tar.xz <-- failing controller http://chrisj.cloud/sosreport-leaf1-controller-1-2023-06-28-pskyypk.tar.xz http://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-28-tgombfj.tar.xz
Hey Chris, Has this happened multiple times? What does memory usage look like on the other two controllers now? Since we have sosreports from all of them, maybe if we generate another sosreport now on all of them and compared the process lists to see which process is consuming more memory compared to the first sosreports? But we need to narrow it down a little to determine which process is to blame and then we can set the correct component on this BZ and assign to the relevant team.
Hey Brendan, Thanks for looking into it. I do understand it would make sense to nail down the service. Unfortunately after recovering yesterday and then running for the all day, I have hit the OOM error on 2 of my controllers and I am not able to get the sosreport from there anymore. I was able to snap the sos report on the last surviving controller and uploaded it in here: https://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-28-tgombfj-day-later.tar.xz Few more details: - this happens roughly every 30 days or so .. but I only opened a BZ now (even though we've been running this cluster for few months). - we've been running long term OSP cluster since OSP8 on the same hardware .. and this is the first one we notice running OOM, for example our OSP16.1 cluster had an uptime of over 1 year. - I cannot identify myself which service is causing this issue, hence I opened the general BZ. For example: Wed Jun 28 10:57:26 AM EDT 2023 Top 10 Processes by MEM % Avail Active Total Percent Avail 36070MB 476MB 36546MB 98.6975 USER %CPU %MEM RSZ COMMAND 42405 5.2 0.3 391.23 MB ceilometer-agent-notification: 48 0.0 0.2 291.91 MB horizon openvsw+ 2.7 0.2 290.969 MB ovs-vswitchd 48 0.0 0.2 290.551 MB horizon 48 0.0 0.2 290.195 MB horizon 48 0.0 0.2 288.613 MB horizon 48 0.0 0.2 282.766 MB horizon 48 0.0 0.2 281.914 MB horizon 48 0.0 0.2 280.18 MB horizon == Last Half Hour == Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 06/28/2023 _x86_64_ (32 CPU) 12:00:01 AM kbmemfree kbavail kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 10:30:01 AM 4667220 9570436 73866756 56.53 764 1181644 49331892 37.75 294572 30733980 516 10:40:01 AM 4280344 9567440 73849692 56.51 832 1598184 49510440 37.89 464112 30933808 1072 10:50:03 AM 4187912 9500648 73909064 56.56 832 1628548 49520260 37.89 465388 31004332 1164 Average: 2127440 6456156 77308899 59.16 59 626764 50970205 39.00 271005 34052353 1059 Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 06/28/2023 _x86_64_ (32 CPU) 12:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 10:30:01 AM 19608.95 15629.49 45031.34 32.51 47201.35 4947.52 23.34 9202.13 185.12 10:40:01 AM 575.98 1147.37 32192.17 1.63 24212.09 0.00 1.25 2.46 197.07 10:50:03 AM 30.11 991.22 32418.89 0.11 24803.35 0.00 0.00 0.00 0.00 Average: 12606.28 1093.72 31542.23 26.73 27921.93 5308.18 28066.61 6516.09 19.52 == Current 2 Second Intervals == Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 06/28/2023 _x86_64_ (32 CPU) 10:57:26 AM kbmemfree kbavail kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 10:57:28 AM 4099268 9426428 73979236 56.61 832 1644820 49649664 37.99 465932 31114768 2432 10:57:30 AM 4116412 9443568 73962064 56.60 832 1644848 49655652 38.00 465940 31118800 3032 10:57:32 AM 4162188 9488888 73920068 56.57 832 1641200 49494100 37.87 465908 31062028 272 10:57:34 AM 4160564 9487280 73921684 56.57 832 1641208 49497168 37.88 465908 31066836 244 10:57:36 AM 4149744 9476476 73932488 56.58 832 1641224 49498340 37.88 465908 31060616 640 Average: 4137635 9464528 73943108 56.58 832 1642660 49558985 37.92 465919 31084610 1324 Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 06/28/2023 _x86_64_ (32 CPU) 10:57:36 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 10:57:38 AM 0.00 470.00 8952.00 0.00 6844.50 0.00 0.00 0.00 0.00 10:57:40 AM 64.00 238.00 20768.00 0.00 16524.00 0.00 0.00 0.00 0.00 10:57:42 AM 0.00 446.00 19351.50 0.00 5276.50 0.00 0.00 0.00 0.00 10:57:44 AM 0.00 34.50 36172.50 0.00 27441.50 0.00 0.00 0.00 0.00 10:57:46 AM 0.00 64.00 8302.50 0.00 13295.00 0.00 0.00 0.00 0.00 Average: 12.80 250.50 18709.30 0.00 13876.30 0.00 0.00 0.00 0.00 - today it seem that I was in the middle up syncing glance images across multiple glance stores when I suddenly lost the 2 controllers Thanks for looking into it
I was able to capture sosreport from all the controllers after the fresh reboot: http://chrisj.cloud/sosreport-leaf1-controller-0-2023-06-29-wucsepe.tar.xz http://chrisj.cloud/sosreport-leaf1-controller-1-2023-06-29-jsthxse.tar.xz http://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-29-rqqshfx.tar.xz again 2 of the controllers controller-0 and controller-1 have died earlier today and I was forced to reboot and ended up rebooting all 3.
Maybe we need to do something like set up a timer to run on a schedule and create a log file for us. So that we can see which processes are using the most memory and if they are increasing their memory usage over time. So some timer like (Note everything is run as root here): [root@fedora ~]# cat << EOF > /etc/systemd/system/memory_usage.timer [Unit] Description=Memory usage service Wants=memory_usage.timer [Timer] OnUnitActiveSec=1h [Install] WantedBy=multi-user.target EOF [root@fedora ~]# cat << EOF > /etc/systemd/system/memory_usage.service [Unit] Description=Memory usage service Wants=memory_usage.timer [Service] Type=oneshot ExecStart=/bin/bash -c 'ps -eo pid,user,cmd,%%mem | grep -E -v 0.0 >> /memory_usage.log' [Install] WantedBy=multi-user.target EOF [root@fedora ~]# systemctl daemon-reload [root@fedora ~]# systemctl start memory_usage.timer [root@fedora ~]# systemctl list-timers NEXT LEFT LAST PASSED UNIT ACTIVATES Mon 2023-07-03 04:53:44 UTC 22min left Mon 2023-07-03 03:48:01 UTC 43min ago dnf-makecache.timer dnf-makecache.service Tue 2023-07-04 00:00:00 UTC 19h left Mon 2023-07-03 00:25:47 UTC 4h 5min ago logrotate.timer logrotate.service Tue 2023-07-04 00:00:00 UTC 19h left Mon 2023-07-03 00:25:47 UTC 4h 5min ago unbound-anchor.timer unbound-anchor.service Tue 2023-07-04 02:00:01 UTC 21h left Mon 2023-07-03 02:00:01 UTC 2h 31min ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service Sun 2023-07-09 01:00:00 UTC 5 days left Sun 2023-07-02 01:11:17 UTC 1 day 3h ago raid-check.timer raid-check.service Mon 2023-07-10 00:53:48 UTC 6 days left Mon 2023-07-03 00:25:47 UTC 4h 5min ago fstrim.timer fstrim.service - - Mon 2023-07-03 04:27:01 UTC 4min 25s ago memory_usage.timer memory_usage.service. <<<< New timer And you can test it by starting the service: [root@fedora ~]# systemctl start memory_usage.service [root@fedora ~]# cat /memory_usage.log PID USER CMD %MEM 141454 root /usr/sbin/nordvpnd 0.2 141579 root /usr/lib/systemd/systemd-jo 0.1 239379 fedora /usr/bin/nvim --embed 0.1 239463 fedora /home/fedora/.local/share/n 1.4 247348 influxdb /usr/bin/influxd 0.8 317328 fedora /tmp/go-build2531535926/b00 0.2 352604 root /usr/sbin/smbd --foreground 0.1 And this will show that the timer has been executed now: [root@fedora ~]# systemctl list-timers NEXT LEFT LAST PASSED UNIT ACTIVATES Mon 2023-07-03 04:53:44 UTC 20min left Mon 2023-07-03 03:48:01 UTC 45min ago dnf-makecache.timer dnf-makecache.service Mon 2023-07-03 05:32:11 UTC 58min left Mon 2023-07-03 04:27:01 UTC 6min ago memory_usage.timer memory_usage.service Which will give us something like this over time: ❯ cat memory_usage.log fedora@fedora PID USER CMD %MEM 141454 root /usr/sbin/nordvpnd 0.2 141579 root /usr/lib/systemd/systemd-jo 0.1 239379 fedora /usr/bin/nvim --embed 0.1 239463 fedora /home/fedora/.local/share/n 1.4 247348 influxdb /usr/bin/influxd 0.7 317328 fedora /tmp/go-build2531535926/b00 0.2 352604 root /usr/sbin/smbd --foreground 0.1 PID USER CMD %MEM 141454 root /usr/sbin/nordvpnd 0.2 141579 root /usr/lib/systemd/systemd-jo 0.1 239379 fedora /usr/bin/nvim --embed 0.1 239463 fedora /home/fedora/.local/share/n 1.4 247348 influxdb /usr/bin/influxd 0.7 317328 fedora /tmp/go-build2531535926/b00 0.2 352604 root /usr/sbin/smbd --foreground 0.1 PID USER CMD %MEM 141454 root /usr/sbin/nordvpnd 0.2 141579 root /usr/lib/systemd/systemd-jo 0.1 239379 fedora /usr/bin/nvim --embed 0.1 239463 fedora /home/fedora/.local/share/n 1.4 247348 influxdb /usr/bin/influxd 0.7 317328 fedora /tmp/go-build2531535926/b00 0.2 352604 root /usr/sbin/smbd --foreground 0.1 Hopefully that will help track down the process causing the headaches.
Hi Brendan, Thanks for looking into it. I have installed the script, but I don't believe it will show us anything valuable. I have attached above a screenshot of the top command, sorted by the %mem column. It doesn't indicate any single service being an issue. At the same time note the amount of free memory for the entire system. I do see a large amount of http services with 0.2 or 0.1 usage. Is it possible that OSP17.0 doesn't terminate/clean up older http sessions ?
Hey, So those aren't individual httpd sessions, they represent the processes running the API services. I think most services default to running one API worker per CPU core. So for each service, you will end up with one httpd process per CPU core which could definitely consume all of the available memory. This is one thing that we have a note for in all of the service templates: https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/deployment/neutron/neutron-api-container-puppet.yaml#L62-L72 So we can try tuning those to see if that solves the problem. The fact that they run for a while and then OOM seems like something is continuously growing, but I guess if you have hundreds of httpd processes running it wouldn't help. Maybe we could try setting each services *Workers count to something more rational like 8? Then recheck to see if anything stands out over time for memory consumption?
Hey Chris, just checking in to see whether we tried adjusting the Workers count for the services to see if that helped with this problem at all?
Hi Brendan, Sorry for the delay in responding. I agree that having a fixed amount of workers would make more sense. I noticed you've pasted example for the neutron. Would there be a variable in tripleo that would limit all the workers for all the services ? If not, is there a list of service workers that I could inject to my config?
There's no single variable no, but you could start with this list: https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/environments/low-memory-usage.yaml#L3-L13 Try setting all of them to 4 or 8 instead of 1 and see if you still end up having issues.
Hi Brendan, I have applied the suggested configuration and the memory utilization have dropped a little bit but not much. I am including before and after screenshots of top sorted by the memory usage. Also number of services with 0.1 of memory decreased from 64 to 44: [root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l 64 [root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l 44 Still the memory utilization is rather high and it keeps growing. This is a semi production system, I'd like to see if this will eventually stabilize. We've not seen this in OSP16.X. Is there anything else you would want me to validate/modify? If not, let's keep this BZ open for next few weeks to see if the mem utilization keeps going up on this system.
Hey, Yeah, we basically need to narrow this down to some process to make sure the right engineers are involved to troubleshoot. I would be interested to know if that collectd process is growing in memory utilization? Is it consuming more than 1.6g of memory now?
Hi Brendan, Over the weekend the number of services that consume at least 0.1 % of memory is back to 58: [root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l 58 The current top output sorted by memory: top - 12:34:06 up 17 days, 20:53, 1 user, load average: 6.37, 6.74, 6.75 Tasks: 1181 total, 4 running, 1176 sleeping, 0 stopped, 1 zombie %Cpu(s): 18.5 us, 5.7 sy, 0.0 ni, 74.3 id, 0.0 wa, 0.6 hi, 0.8 si, 0.0 st MiB Mem : 127617.0 total, 34479.8 free, 74696.6 used, 18440.6 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 51775.4 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 40123 42409 20 0 9244444 6.7g 11224 S 0.0 5.3 12:12.72 collectd-sensub 15654 42434 20 0 17.7g 1.2g 158776 S 14.3 1.0 1018:37 mariadbd 20841 42439 20 0 6923708 377184 70200 S 14.3 0.3 1679:49 beam.smp 1516 openvsw+ 10 -10 3081476 275716 31112 S 2.9 0.2 688:50.31 ovs-vswitchd 891676 42465 20 0 946976 211728 12680 S 1.6 0.2 80:59.90 qdrouterd 141370 42435 20 0 230876 211324 6556 S 0.0 0.2 111:47.92 neutron-server: 141388 42435 20 0 230364 210888 6556 S 0.6 0.2 95:56.37 neutron-server: 141376 42435 20 0 229596 210228 6556 S 4.5 0.2 97:38.59 neutron-server: 141403 42435 20 0 229596 210108 6556 S 0.0 0.2 104:12.78 neutron-server: 141412 42435 20 0 229596 210056 6568 S 1.9 0.2 102:00.11 neutron-server: 141381 42435 20 0 229340 209972 6556 R 16.6 0.2 100:57.41 neutron-server: 141369 42435 20 0 229340 209920 6556 S 0.3 0.2 102:54.30 neutron-server: 141383 42435 20 0 229340 209824 6556 S 0.0 0.2 98:36.04 neutron-server: 141489 42435 20 0 227296 207196 5824 S 0.0 0.2 4:12.89 neutron-server: 141460 42435 20 0 225504 205552 5972 S 0.0 0.2 4:36.00 neutron-server: 141464 42435 20 0 224736 204784 5972 S 0.0 0.2 4:34.45 neutron-server: 141453 42435 20 0 224992 204764 5928 S 0.0 0.2 4:34.07 neutron-server: 141441 42435 20 0 222688 202736 5972 S 0.0 0.2 4:48.89 neutron-server: 141472 42435 20 0 222688 202736 5972 S 0.0 0.2 4:36.39 neutron-server: 141447 42435 20 0 222688 202660 5896 S 0.0 0.2 4:42.43 neutron-server: 141442 42435 20 0 222432 202440 5972 S 0.0 0.2 4:36.07 neutron-server: 141474 42435 20 0 222432 202268 5972 S 0.0 0.2 4:34.20 neutron-server: 958 root 20 0 323692 190428 186672 S 1.6 0.1 220:42.12 systemd-journal 889154 42457 20 0 2649284 188480 4316 S 0.0 0.1 15:25.92 memcached 141495 42435 20 0 208348 187532 5264 S 0.0 0.1 1:18.99 neutron-server: 134148 42435 20 0 196832 187276 16520 S 0.3 0.1 2:15.13 /usr/bin/python 3127 root rt 0 577208 183916 67572 S 0.3 0.1 243:03.92 corosync 141476 42435 20 0 199392 178760 5412 S 0.0 0.1 0:42.36 neutron-server: 48358 48 20 0 858248 168844 14092 S 0.0 0.1 0:42.15 httpd 48349 48 20 0 856456 167924 14088 S 0.0 0.1 0:34.39 httpd 122474 42407 20 0 733276 162280 31420 S 0.0 0.1 11:18.16 httpd 122477 42407 20 0 732764 162228 31420 S 0.0 0.1 11:22.72 httpd 122476 42407 20 0 732508 161708 31420 S 0.0 0.1 10:59.91 httpd 122479 42407 20 0 585556 161036 31420 S 0.0 0.1 11:14.18 httpd 122478 42407 20 0 658776 161016 31420 S 0.0 0.1 10:47.51 httpd 122475 42407 20 0 585812 160936 31420 S 0.0 0.1 10:41.43 httpd 122473 42407 20 0 733020 160568 31420 S 0.3 0.1 11:07.19 httpd 48363 48 20 0 928652 158852 14092 S 0.0 0.1 0:36.28 httpd 48369 48 20 0 853384 158748 14148 S 0.0 0.1 0:32.64 httpd 122472 42407 20 0 511056 158356 31420 S 0.0 0.1 10:46.30 httpd 48362 48 20 0 856712 158112 14092 S 0.0 0.1 0:34.74 httpd 48367 48 20 0 929932 157652 14092 S 0.0 0.1 0:32.73 httpd 48360 48 20 0 856968 157088 14092 S 0.0 0.1 0:39.56 httpd 48359 48 20 0 928140 156104 14084 S 0.0 0.1 0:33.38 httpd 171815 42436 20 0 277992 150184 16276 S 0.0 0.1 33:42.93 httpd 48364 48 20 0 928140 149796 14092 S 0.0 0.1 0:31.35 httpd 171818 42436 20 0 277992 149464 16276 S 0.0 0.1 33:55.05 httpd 171813 42436 20 0 282860 148868 16276 S 0.0 0.1 33:33.54 httpd 171816 42436 20 0 278760 148744 16276 S 3.9 0.1 33:39.95 httpd 171811 42436 20 0 282860 148700 16276 S 0.6 0.1 32:53.68 httpd 171812 42436 20 0 277736 148544 16276 S 0.0 0.1 33:17.54 httpd 171817 42436 20 0 277480 147816 16276 S 0.0 0.1 33:18.56 httpd 171814 42436 20 0 277480 147532 16276 S 0.3 0.1 33:21.68 httpd 235715 42405 20 0 5035728 146792 8900 S 2.9 0.1 268:59.37 ceilometer-agen 48368 48 20 0 855688 146744 14092 S 0.0 0.1 0:32.49 httpd 48356 48 20 0 631164 145208 14092 S 0.0 0.1 0:29.84 httpd 48366 48 20 0 854152 142424 14092 S 0.0 0.1 0:29.60 httpd 124073 42407 20 0 286732 133108 30964 S 0.3 0.1 7:33.50 cinder-schedule 185381 42435 20 0 139412 128512 16216 S 1.0 0.1 55:53.26 neutron-dhcp-ag 358600 root 20 0 280100 126900 31412 S 1.3 0.1 42:58.64 cinder-volume 611254 42437 20 0 342592 125272 12024 S 0.0 0.1 6:39.86 octavia-health- 171200 42436 20 0 253156 124652 15848 S 0.0 0.1 1:07.22 httpd 171196 42436 20 0 253156 124364 15824 S 0.0 0.1 1:06.76 httpd 171193 42436 20 0 253156 124300 15824 S 0.0 0.1 1:07.15 httpd 171199 42436 20 0 253156 124240 15824 S 0.0 0.1 1:07.00 httpd 360748 root 20 0 488008 124060 17428 S 0.0 0.1 10:55.18 cinder-volume So I have lost 17GB of available memory over the weekend. The collectd is higher in it's usage, but definitely not by 17GB It's been only 17 days since my last reboot: [root@leaf1-controller-0 ~]# uptime 12:37:59 up 17 days, 20:57, 1 user, load average: 6.23, 6.63, 6.72 It doesn't look like limiting number of workers did the trick.
Given that collectd is the only thing that has really grown considerably, let's start by asking the cloudops team if they see any issue with that level of memory consumption and growth.
what collect modules / plugins are installed / running ?
collectd grows memory if it can not send data. More material http://matthias-runge.de/2021/03/22/collectd-memory-usage/
Also: collectd should be limited to 512 megs by tripleo, see e.g https://bugzilla.redhat.com/show_bug.cgi?id=2007255 or related bzs. I am curious how this was configured here in the case.
Chris, can you please comment or check regarding comment 19?