Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2218295

Summary:	OOM error kills the controller node and ultimately the entire cluster - OSP Controller runs out of memory
Product:	Red Hat OpenStack	Reporter:	Chris Janiszewski <cjanisze>
Component:	openstack-tripleo	Assignee:	Martin Magr <mmagr>
Status:	CLOSED WONTFIX	QA Contact:	myadla
Severity:	medium	Docs Contact:
Priority:	medium
Version:	17.0 (Wallaby)	CC:	bshephar, drosenfe, jlarriba, mbayer, mburns, mmagr, mrunge
Target Milestone:	z2	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-10-10 14:33:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Chris Janiszewski 2023-06-28 15:44:24 UTC

Description of problem:
First I want to apologize if I have misclassified the OpenStack component. It might also be a RHEL issue, so feel free to re-route if needed

RH OSP17.0.1 has stopped working due to OOM error on a single controller node, which triggered galera and rabbitmq to error out.

Jun 28 08:21:20 leaf1-controller-0 kernel: podman invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Jun 28 08:21:20 leaf1-controller-0 kernel: CPU: 3 PID: 1024000 Comm: podman Not tainted 5.14.0-162.22.2.el9_1.x86_64 #1
Jun 28 08:21:20 leaf1-controller-0 kernel: IPv4: martian source 169.254.95.120 from 169.254.95.118, on dev enp0s20u1u5
Jun 28 08:21:20 leaf1-controller-0 kernel: Hardware name: LENOVO Lenovo Flex System x240 M5 Compute Node -[9532AC1]-/-[9532AC1]-, BIOS -[C4E144A-3.10]- 04/09/2020
Jun 28 08:21:20 leaf1-controller-0 kernel: Call Trace:
Jun 28 08:21:20 leaf1-controller-0 kernel: dump_stack_lvl+0x34/0x48
Jun 28 08:21:20 leaf1-controller-0 kernel: ll header: 00000000: ff ff ff ff ff ff 08 94 ef 25 85 cd 08 06
Jun 28 08:21:20 leaf1-controller-0 kernel: dump_header+0x4a/0x201
Jun 28 08:21:20 leaf1-controller-0 kernel: oom_kill_process.cold+0xb/0x10
Jun 28 08:21:20 leaf1-controller-0 kernel: out_of_memory+0xed/0x2d0
Jun 28 08:21:20 leaf1-controller-0 kernel: __alloc_pages_slowpath.constprop.0+0x7cc/0x8a0
Jun 28 08:21:20 leaf1-controller-0 kernel: __alloc_pages+0x1fe/0x230
Jun 28 08:21:20 leaf1-controller-0 kernel: folio_alloc+0x17/0x50
Jun 28 08:21:20 leaf1-controller-0 kernel: __filemap_get_folio+0x1b6/0x330
Jun 28 08:21:20 leaf1-controller-0 kernel: IPv4: martian source 169.254.95.120 from 169.254.95.118, on dev enp0s20u1u5
Jun 28 08:21:20 leaf1-controller-0 kernel: ? do_sync_mmap_readahead+0x14b/0x270
Jun 28 08:21:20 leaf1-controller-0 kernel: ll header: 00000000: ff ff ff ff ff ff 08 94 ef 25 85 cd 08 06
Jun 28 08:21:20 leaf1-controller-0 kernel: filemap_fault+0x454/0x7a0
Jun 28 08:21:20 leaf1-controller-0 kernel: ? next_uptodate_page+0x160/0x1f0
Jun 28 08:21:20 leaf1-controller-0 kernel: ? filemap_map_pages+0x307/0x4a0
Jun 28 08:21:20 leaf1-controller-0 kernel: __xfs_filemap_fault+0x66/0x280 [xfs]
Jun 28 08:21:20 leaf1-controller-0 kernel: __do_fault+0x36/0x110
Jun 28 08:21:20 leaf1-controller-0 kernel: do_read_fault+0xea/0x190
Jun 28 08:21:20 leaf1-controller-0 kernel: do_fault+0x8c/0x2c0
Jun 28 08:21:20 leaf1-controller-0 kernel: __handle_mm_fault+0x3cb/0x750
Jun 28 08:21:20 leaf1-controller-0 kernel: handle_mm_fault+0xc5/0x2a0
Jun 28 08:21:20 leaf1-controller-0 kernel: do_user_addr_fault+0x1bb/0x690
Jun 28 08:21:20 leaf1-controller-0 kernel: exc_page_fault+0x62/0x150
Jun 28 08:21:20 leaf1-controller-0 kernel: asm_exc_page_fault+0x22/0x30
Jun 28 08:21:20 leaf1-controller-0 kernel: RIP: 0033:0x5597f2ea7f91
Jun 28 08:21:20 leaf1-controller-0 kernel: IPv4: martian source 169.254.95.120 from 169.254.95.118, on dev enp0s20u1u5
Jun 28 08:21:20 leaf1-controller-0 kernel: Code: Unable to access opcode bytes at RIP 0x5597f2ea7f67.
Jun 28 08:21:20 leaf1-controller-0 kernel: RSP: 002b:000000c000987870 EFLAGS: 00010246

shortly after pacemaker started failing:
Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Result of monitor operation for galera on galera-bundle-2: Timed Out after 30s (Resource agent did not complete in time)
Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Result of monitor operation for redis on redis-bundle-2: Timed Out after 60s (Resource agent did not complete in time)
Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: notice: High CPU load detected: 921.770020
Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Result of monitor operation for rabbitmq-bundle-2 on leaf1-controller-0: Timed Out after 30s (Remote executor did not respond)
Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Lost connection to Pacemaker Remote node rabbitmq-bundle-2
Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: localized_remote_header: Triggered fatal assertion at remote.c:107 : endian == ENDIAN_LOCAL
Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: Invalid message detected, endian mismatch: badadbbd is neither 6d726c3c nor the swab'd 3c6c726d
Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: localized_remote_header: Triggered fatal assertion at remote.c:107 : endian == ENDIAN_LOCAL
Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: Invalid message detected, endian mismatch: badadbbd is neither 6d726c3c nor the swab'd 3c6c726d

and rabbitmq?
/var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: show_signal_msg: 13 callbacks suppressed
/var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: handle[10844]: segfault at 18 ip 00007f09e061aa8b sp 00007f09b1d809c8 error 4
/var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: ll header: 00000000: ff ff ff ff ff ff 08 94 ef 25 85 cd 08 06
/var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: in libqpid-proton.so.11.13.0[7f09e05ff000+38000]

I couldn't identify what component caused the underlying OOM failure. Maybe a memory leak?

Here are some oom errors:
/var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: podman invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
/var/log/messages:Jun 28 08:25:57 leaf1-controller-0 kernel: ironic-inspecto invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=0, oom_score_adj=0
/var/log/messages:Jun 28 08:52:40 leaf1-controller-0 kernel: httpd invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
/var/log/messages:Jun 28 09:07:54 leaf1-controller-0 kernel: pcsd invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
/var/log/messages:Jun 28 09:22:28 leaf1-controller-0 kernel: pcsd invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
/var/log/messages:Jun 28 10:21:42 leaf1-controller-0 kernel: nova-conductor invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0


Version-Release number of selected component (if applicable):
OSP17.0.1

How reproducible:
Run OSP17.0.1 over longer period of time

Steps to Reproduce:
1. deploy OSP17.0.1
2. run it for long period of time
3.

Actual results:
OOM error causing multiple components to fail

Expected results:
no error / no failure

Additional info:
sosreport from all the controllers.
http://chrisj.cloud/sosreport-leaf1-controller-0-2023-06-28-nwztdxl.tar.xz  <-- failing controller
http://chrisj.cloud/sosreport-leaf1-controller-1-2023-06-28-pskyypk.tar.xz
http://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-28-tgombfj.tar.xz

Comment 1 Brendan Shephard 2023-06-29 09:39:36 UTC

Hey Chris,

Has this happened multiple times? What does memory usage look like on the other two controllers now? Since we have sosreports from all of them, maybe if we generate another sosreport now on all of them and compared the process lists to see which process is consuming more memory compared to the first sosreports?  

But we need to narrow it down a little to determine which process is to blame and then we can set the correct component on this BZ and assign to the relevant team.

Comment 2 Chris Janiszewski 2023-06-29 14:54:24 UTC

Hey Brendan,

Thanks for looking into it. I do understand it would make sense to nail down the service. Unfortunately after recovering yesterday and then running for the all day, I have hit the OOM error on 2 of my controllers and I am not able to get the sosreport from there anymore.
I was able to snap the sos report on the last surviving controller and uploaded it in here:
https://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-28-tgombfj-day-later.tar.xz


Few more details:
- this happens roughly every 30 days or so .. but I only opened a BZ now (even though we've been running this cluster for few months).
- we've been running long term OSP cluster since OSP8 on the same hardware .. and this is the first one we notice running OOM, for example our OSP16.1 cluster had an uptime of over 1 year.
- I cannot identify myself which service is causing this issue, hence I opened the general BZ. For example:
 
Wed Jun 28 10:57:26 AM EDT 2023
Top 10 Processes by MEM %

Avail	Active	Total	Percent Avail
36070MB	476MB	36546MB	98.6975

    USER   %CPU   %MEM      RSZ     COMMAND
   42405    5.2    0.3   391.23 MB  ceilometer-agent-notification:
      48    0.0    0.2   291.91 MB  horizon
openvsw+    2.7    0.2  290.969 MB  ovs-vswitchd
      48    0.0    0.2  290.551 MB  horizon
      48    0.0    0.2  290.195 MB  horizon
      48    0.0    0.2  288.613 MB  horizon
      48    0.0    0.2  282.766 MB  horizon
      48    0.0    0.2  281.914 MB  horizon
      48    0.0    0.2   280.18 MB  horizon

== Last Half Hour ==

Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 	06/28/2023 	_x86_64_	(32 CPU)

12:00:01 AM kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
10:30:01 AM   4667220   9570436  73866756     56.53       764   1181644  49331892     37.75    294572  30733980       516
10:40:01 AM   4280344   9567440  73849692     56.51       832   1598184  49510440     37.89    464112  30933808      1072
10:50:03 AM   4187912   9500648  73909064     56.56       832   1628548  49520260     37.89    465388  31004332      1164
Average:      2127440   6456156  77308899     59.16        59    626764  50970205     39.00    271005  34052353      1059

Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 	06/28/2023 	_x86_64_	(32 CPU)

12:00:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
10:30:01 AM  19608.95  15629.49  45031.34     32.51  47201.35   4947.52     23.34   9202.13    185.12
10:40:01 AM    575.98   1147.37  32192.17      1.63  24212.09      0.00      1.25      2.46    197.07
10:50:03 AM     30.11    991.22  32418.89      0.11  24803.35      0.00      0.00      0.00      0.00
Average:     12606.28   1093.72  31542.23     26.73  27921.93   5308.18  28066.61   6516.09     19.52

== Current 2 Second Intervals ==

Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 	06/28/2023 	_x86_64_	(32 CPU)

10:57:26 AM kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
10:57:28 AM   4099268   9426428  73979236     56.61       832   1644820  49649664     37.99    465932  31114768      2432
10:57:30 AM   4116412   9443568  73962064     56.60       832   1644848  49655652     38.00    465940  31118800      3032
10:57:32 AM   4162188   9488888  73920068     56.57       832   1641200  49494100     37.87    465908  31062028       272
10:57:34 AM   4160564   9487280  73921684     56.57       832   1641208  49497168     37.88    465908  31066836       244
10:57:36 AM   4149744   9476476  73932488     56.58       832   1641224  49498340     37.88    465908  31060616       640
Average:      4137635   9464528  73943108     56.58       832   1642660  49558985     37.92    465919  31084610      1324

Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 	06/28/2023 	_x86_64_	(32 CPU)

10:57:36 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
10:57:38 AM      0.00    470.00   8952.00      0.00   6844.50      0.00      0.00      0.00      0.00
10:57:40 AM     64.00    238.00  20768.00      0.00  16524.00      0.00      0.00      0.00      0.00
10:57:42 AM      0.00    446.00  19351.50      0.00   5276.50      0.00      0.00      0.00      0.00
10:57:44 AM      0.00     34.50  36172.50      0.00  27441.50      0.00      0.00      0.00      0.00
10:57:46 AM      0.00     64.00   8302.50      0.00  13295.00      0.00      0.00      0.00      0.00
Average:        12.80    250.50  18709.30      0.00  13876.30      0.00      0.00      0.00      0.00


- today it seem that I was in the middle up syncing glance images across multiple glance stores when I suddenly lost the 2 controllers

Thanks for looking into it

Comment 3 Chris Janiszewski 2023-06-29 15:57:41 UTC

I was able to capture sosreport from all the controllers after the fresh reboot:
http://chrisj.cloud/sosreport-leaf1-controller-0-2023-06-29-wucsepe.tar.xz
http://chrisj.cloud/sosreport-leaf1-controller-1-2023-06-29-jsthxse.tar.xz
http://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-29-rqqshfx.tar.xz

again 2 of the controllers controller-0 and controller-1 have died earlier today and I was forced to reboot and ended up rebooting all 3.

Comment 4 Brendan Shephard 2023-07-03 04:34:10 UTC

Maybe we need to do something like set up a timer to run on a schedule and create a log file for us. So that we can see which processes are using the most memory and if they are increasing their memory usage over time.

So some timer like (Note everything is run as root here):
[root@fedora ~]# cat << EOF > /etc/systemd/system/memory_usage.timer
[Unit]
Description=Memory usage service
Wants=memory_usage.timer

[Timer]
OnUnitActiveSec=1h

[Install]
WantedBy=multi-user.target
EOF

[root@fedora ~]# cat << EOF > /etc/systemd/system/memory_usage.service
[Unit]
Description=Memory usage service
Wants=memory_usage.timer

[Service]
Type=oneshot
ExecStart=/bin/bash -c 'ps -eo pid,user,cmd,%%mem | grep -E -v 0.0 >> /memory_usage.log'

[Install]
WantedBy=multi-user.target
EOF

[root@fedora ~]# systemctl daemon-reload
[root@fedora ~]# systemctl start memory_usage.timer
[root@fedora ~]# systemctl list-timers
NEXT                        LEFT        LAST                        PASSED       UNIT                         ACTIVATES
Mon 2023-07-03 04:53:44 UTC 22min left  Mon 2023-07-03 03:48:01 UTC 43min ago    dnf-makecache.timer          dnf-makecache.service
Tue 2023-07-04 00:00:00 UTC 19h left    Mon 2023-07-03 00:25:47 UTC 4h 5min ago  logrotate.timer              logrotate.service
Tue 2023-07-04 00:00:00 UTC 19h left    Mon 2023-07-03 00:25:47 UTC 4h 5min ago  unbound-anchor.timer         unbound-anchor.service
Tue 2023-07-04 02:00:01 UTC 21h left    Mon 2023-07-03 02:00:01 UTC 2h 31min ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Sun 2023-07-09 01:00:00 UTC 5 days left Sun 2023-07-02 01:11:17 UTC 1 day 3h ago raid-check.timer             raid-check.service
Mon 2023-07-10 00:53:48 UTC 6 days left Mon 2023-07-03 00:25:47 UTC 4h 5min ago  fstrim.timer                 fstrim.service
-                           -           Mon 2023-07-03 04:27:01 UTC 4min 25s ago memory_usage.timer           memory_usage.service. <<<< New timer


And you can test it by starting the service:
[root@fedora ~]# systemctl start memory_usage.service
[root@fedora ~]# cat /memory_usage.log
    PID USER     CMD                         %MEM
 141454 root     /usr/sbin/nordvpnd           0.2
 141579 root     /usr/lib/systemd/systemd-jo  0.1
 239379 fedora   /usr/bin/nvim --embed        0.1
 239463 fedora   /home/fedora/.local/share/n  1.4
 247348 influxdb /usr/bin/influxd             0.8
 317328 fedora   /tmp/go-build2531535926/b00  0.2
 352604 root     /usr/sbin/smbd --foreground  0.1

And this will show that the timer has been executed now:
[root@fedora ~]# systemctl list-timers
NEXT                        LEFT        LAST                        PASSED       UNIT                         ACTIVATES
Mon 2023-07-03 04:53:44 UTC 20min left  Mon 2023-07-03 03:48:01 UTC 45min ago    dnf-makecache.timer          dnf-makecache.service
Mon 2023-07-03 05:32:11 UTC 58min left  Mon 2023-07-03 04:27:01 UTC 6min ago     memory_usage.timer           memory_usage.service


Which will give us something like this over time:
❯ cat memory_usage.log                                                                                                                                                                                                        fedora@fedora
    PID USER     CMD                         %MEM
 141454 root     /usr/sbin/nordvpnd           0.2
 141579 root     /usr/lib/systemd/systemd-jo  0.1
 239379 fedora   /usr/bin/nvim --embed        0.1
 239463 fedora   /home/fedora/.local/share/n  1.4
 247348 influxdb /usr/bin/influxd             0.7
 317328 fedora   /tmp/go-build2531535926/b00  0.2
 352604 root     /usr/sbin/smbd --foreground  0.1
    PID USER     CMD                         %MEM
 141454 root     /usr/sbin/nordvpnd           0.2
 141579 root     /usr/lib/systemd/systemd-jo  0.1
 239379 fedora   /usr/bin/nvim --embed        0.1
 239463 fedora   /home/fedora/.local/share/n  1.4
 247348 influxdb /usr/bin/influxd             0.7
 317328 fedora   /tmp/go-build2531535926/b00  0.2
 352604 root     /usr/sbin/smbd --foreground  0.1
    PID USER     CMD                         %MEM
 141454 root     /usr/sbin/nordvpnd           0.2
 141579 root     /usr/lib/systemd/systemd-jo  0.1
 239379 fedora   /usr/bin/nvim --embed        0.1
 239463 fedora   /home/fedora/.local/share/n  1.4
 247348 influxdb /usr/bin/influxd             0.7
 317328 fedora   /tmp/go-build2531535926/b00  0.2
 352604 root     /usr/sbin/smbd --foreground  0.1


Hopefully that will help track down the process causing the headaches.

Comment 6 Chris Janiszewski 2023-07-07 15:42:47 UTC

Hi Brendan,

Thanks for looking into it. I have installed the script, but I don't believe it will show us anything valuable. I have attached above a screenshot of the top command, sorted by the %mem column. It doesn't indicate any single service being an issue. At the same time note the amount of free memory for the entire system. I do see a large amount of http services with 0.2 or 0.1 usage. Is it possible that OSP17.0 doesn't terminate/clean up older http sessions ?

Comment 7 Brendan Shephard 2023-07-10 00:17:45 UTC

Hey,

So those aren't individual httpd sessions, they represent the processes running the API services.
I think most services default to running one API worker per CPU core. So for each service, you will end up with one httpd process per CPU core which could definitely consume all of the available memory. This is one thing that we have a note for in all of the service templates:
https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/deployment/neutron/neutron-api-container-puppet.yaml#L62-L72

So we can try tuning those to see if that solves the problem. The fact that they run for a while and then OOM seems like something is continuously growing, but I guess if you have hundreds of httpd processes running it wouldn't help. Maybe we could try setting each services *Workers count to something more rational like 8? 

Then recheck to see if anything stands out over time for memory consumption?

Comment 8 Brendan Shephard 2023-07-17 04:45:32 UTC

Hey Chris, just checking in to see whether we tried adjusting the Workers count for the services to see if that helped with this problem at all?

Comment 9 Chris Janiszewski 2023-07-17 17:49:53 UTC

Hi Brendan,

Sorry for the delay in responding. I agree that having a fixed amount of workers would make more sense. I noticed you've pasted example for the neutron. Would there be a variable in tripleo that would limit all the workers for all the services ? If not, is there a list of service workers that I could inject to my config?

Comment 10 Brendan Shephard 2023-07-20 00:30:55 UTC

There's no single variable no, but you could start with this list:
https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/environments/low-memory-usage.yaml#L3-L13

Try setting all of them to 4 or 8 instead of 1 and see if you still end up having issues.

Comment 11 Chris Janiszewski 2023-07-28 20:27:45 UTC

Hi Brendan,

I have applied the suggested configuration and the memory utilization have dropped a little bit but not much. I am including before and after screenshots of top sorted by the memory usage. Also number of services with 0.1 of memory decreased from 64 to 44:
[root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l                                                                                                                                                   
64     

[root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l
44

Still the memory utilization is rather high and it keeps growing. This is a semi production system, I'd like to see if this will eventually stabilize. We've not seen this in OSP16.X. 
Is there anything else you would want me to validate/modify? If not, let's keep this BZ open for next few weeks to see if the mem utilization keeps going up on this system.

Comment 14 Brendan Shephard 2023-07-30 07:35:28 UTC

Hey,

Yeah, we basically need to narrow this down to some process to make sure the right engineers are involved to troubleshoot.

I would be interested to know if that collectd process is growing in memory utilization? Is it consuming more than 1.6g of memory now?

Comment 15 Chris Janiszewski 2023-07-31 16:38:57 UTC

Hi Brendan,

Over the weekend the number of services that consume at least 0.1 % of memory is back to 58:
[root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l
58

The current top output sorted by memory:
top - 12:34:06 up 17 days, 20:53,  1 user,  load average: 6.37, 6.74, 6.75
Tasks: 1181 total,   4 running, 1176 sleeping,   0 stopped,   1 zombie
%Cpu(s): 18.5 us,  5.7 sy,  0.0 ni, 74.3 id,  0.0 wa,  0.6 hi,  0.8 si,  0.0 st
MiB Mem : 127617.0 total,  34479.8 free,  74696.6 used,  18440.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  51775.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                      
  40123 42409     20   0 9244444   6.7g  11224 S   0.0   5.3  12:12.72 collectd-sensub                                                                                              
  15654 42434     20   0   17.7g   1.2g 158776 S  14.3   1.0   1018:37 mariadbd                                                                                                     
  20841 42439     20   0 6923708 377184  70200 S  14.3   0.3   1679:49 beam.smp                                                                                                     
   1516 openvsw+  10 -10 3081476 275716  31112 S   2.9   0.2 688:50.31 ovs-vswitchd                                                                                                 
 891676 42465     20   0  946976 211728  12680 S   1.6   0.2  80:59.90 qdrouterd                                                                                                    
 141370 42435     20   0  230876 211324   6556 S   0.0   0.2 111:47.92 neutron-server:                                                                                              
 141388 42435     20   0  230364 210888   6556 S   0.6   0.2  95:56.37 neutron-server:                                                                                              
 141376 42435     20   0  229596 210228   6556 S   4.5   0.2  97:38.59 neutron-server:                                                                                              
 141403 42435     20   0  229596 210108   6556 S   0.0   0.2 104:12.78 neutron-server:                                                                                              
 141412 42435     20   0  229596 210056   6568 S   1.9   0.2 102:00.11 neutron-server:                                                                                              
 141381 42435     20   0  229340 209972   6556 R  16.6   0.2 100:57.41 neutron-server:                                                                                              
 141369 42435     20   0  229340 209920   6556 S   0.3   0.2 102:54.30 neutron-server:                                                                                              
 141383 42435     20   0  229340 209824   6556 S   0.0   0.2  98:36.04 neutron-server:                                                                                              
 141489 42435     20   0  227296 207196   5824 S   0.0   0.2   4:12.89 neutron-server:                                                                                              
 141460 42435     20   0  225504 205552   5972 S   0.0   0.2   4:36.00 neutron-server:                                                                                              
 141464 42435     20   0  224736 204784   5972 S   0.0   0.2   4:34.45 neutron-server:                                                                                              
 141453 42435     20   0  224992 204764   5928 S   0.0   0.2   4:34.07 neutron-server:                                                                                              
 141441 42435     20   0  222688 202736   5972 S   0.0   0.2   4:48.89 neutron-server:                                                                                              
 141472 42435     20   0  222688 202736   5972 S   0.0   0.2   4:36.39 neutron-server:                                                                                              
 141447 42435     20   0  222688 202660   5896 S   0.0   0.2   4:42.43 neutron-server:                                                                                              
 141442 42435     20   0  222432 202440   5972 S   0.0   0.2   4:36.07 neutron-server:                                                                                              
 141474 42435     20   0  222432 202268   5972 S   0.0   0.2   4:34.20 neutron-server:                                                                                              
    958 root      20   0  323692 190428 186672 S   1.6   0.1 220:42.12 systemd-journal                                                                                              
 889154 42457     20   0 2649284 188480   4316 S   0.0   0.1  15:25.92 memcached                                                                                                    
 141495 42435     20   0  208348 187532   5264 S   0.0   0.1   1:18.99 neutron-server:                                                                                              
 134148 42435     20   0  196832 187276  16520 S   0.3   0.1   2:15.13 /usr/bin/python                                                                                              
   3127 root      rt   0  577208 183916  67572 S   0.3   0.1 243:03.92 corosync                                                                                                     
 141476 42435     20   0  199392 178760   5412 S   0.0   0.1   0:42.36 neutron-server:                                                                                              
  48358 48        20   0  858248 168844  14092 S   0.0   0.1   0:42.15 httpd                                                                                                        
  48349 48        20   0  856456 167924  14088 S   0.0   0.1   0:34.39 httpd                                                                                                        
 122474 42407     20   0  733276 162280  31420 S   0.0   0.1  11:18.16 httpd                                                                                                        
 122477 42407     20   0  732764 162228  31420 S   0.0   0.1  11:22.72 httpd                                                                                                        
 122476 42407     20   0  732508 161708  31420 S   0.0   0.1  10:59.91 httpd                                                                                                        
 122479 42407     20   0  585556 161036  31420 S   0.0   0.1  11:14.18 httpd                                                                                                        
 122478 42407     20   0  658776 161016  31420 S   0.0   0.1  10:47.51 httpd                                                                                                        
 122475 42407     20   0  585812 160936  31420 S   0.0   0.1  10:41.43 httpd                                                                                                        
 122473 42407     20   0  733020 160568  31420 S   0.3   0.1  11:07.19 httpd                                                                                                        
  48363 48        20   0  928652 158852  14092 S   0.0   0.1   0:36.28 httpd                                                                                                        
  48369 48        20   0  853384 158748  14148 S   0.0   0.1   0:32.64 httpd                                                                                                        
 122472 42407     20   0  511056 158356  31420 S   0.0   0.1  10:46.30 httpd                                                                                                        
  48362 48        20   0  856712 158112  14092 S   0.0   0.1   0:34.74 httpd                                                                                                        
  48367 48        20   0  929932 157652  14092 S   0.0   0.1   0:32.73 httpd                                                                                                        
  48360 48        20   0  856968 157088  14092 S   0.0   0.1   0:39.56 httpd                                                                                                        
  48359 48        20   0  928140 156104  14084 S   0.0   0.1   0:33.38 httpd                                                                                                        
 171815 42436     20   0  277992 150184  16276 S   0.0   0.1  33:42.93 httpd                                                                                                        
  48364 48        20   0  928140 149796  14092 S   0.0   0.1   0:31.35 httpd                                                                                                        
 171818 42436     20   0  277992 149464  16276 S   0.0   0.1  33:55.05 httpd                                                                                                        
 171813 42436     20   0  282860 148868  16276 S   0.0   0.1  33:33.54 httpd                                                                                                        
 171816 42436     20   0  278760 148744  16276 S   3.9   0.1  33:39.95 httpd                                                                                                        
 171811 42436     20   0  282860 148700  16276 S   0.6   0.1  32:53.68 httpd                                                                                                        
 171812 42436     20   0  277736 148544  16276 S   0.0   0.1  33:17.54 httpd                                                                                                        
 171817 42436     20   0  277480 147816  16276 S   0.0   0.1  33:18.56 httpd                                                                                                        
 171814 42436     20   0  277480 147532  16276 S   0.3   0.1  33:21.68 httpd                                                                                                        
 235715 42405     20   0 5035728 146792   8900 S   2.9   0.1 268:59.37 ceilometer-agen                                                                                              
  48368 48        20   0  855688 146744  14092 S   0.0   0.1   0:32.49 httpd                                                                                                        
  48356 48        20   0  631164 145208  14092 S   0.0   0.1   0:29.84 httpd                                                                                                        
  48366 48        20   0  854152 142424  14092 S   0.0   0.1   0:29.60 httpd                                                                                                        
 124073 42407     20   0  286732 133108  30964 S   0.3   0.1   7:33.50 cinder-schedule                                                                                              
 185381 42435     20   0  139412 128512  16216 S   1.0   0.1  55:53.26 neutron-dhcp-ag                                                                                              
 358600 root      20   0  280100 126900  31412 S   1.3   0.1  42:58.64 cinder-volume                                                                                                
 611254 42437     20   0  342592 125272  12024 S   0.0   0.1   6:39.86 octavia-health-                                                                                              
 171200 42436     20   0  253156 124652  15848 S   0.0   0.1   1:07.22 httpd                                                                                                        
 171196 42436     20   0  253156 124364  15824 S   0.0   0.1   1:06.76 httpd                                                                                                        
 171193 42436     20   0  253156 124300  15824 S   0.0   0.1   1:07.15 httpd                                                                                                        
 171199 42436     20   0  253156 124240  15824 S   0.0   0.1   1:07.00 httpd                                                                                                        
 360748 root      20   0  488008 124060  17428 S   0.0   0.1  10:55.18 cinder-volume      



So I have lost 17GB of available memory over the weekend. The collectd is higher in it's usage, but definitely not by 17GB

It's been only 17 days since my last reboot:
[root@leaf1-controller-0 ~]# uptime
 12:37:59 up 17 days, 20:57,  1 user,  load average: 6.23, 6.63, 6.72

It doesn't look like limiting number of workers did the trick.

Comment 16 Brendan Shephard 2023-08-01 02:46:26 UTC

Given that collectd is the only thing that has really grown considerably, let's start by asking the cloudops team if they see any issue with that level of memory consumption and growth.

Comment 17 Michael Bayer 2023-08-01 14:10:32 UTC

what collect modules / plugins are installed / running ?

Comment 18 Matthias Runge 2023-08-08 13:00:41 UTC

collectd grows memory if it can not send data. More material http://matthias-runge.de/2021/03/22/collectd-memory-usage/

Comment 19 Matthias Runge 2023-08-08 13:06:09 UTC

Also: collectd should be limited to 512 megs by tripleo, see e.g https://bugzilla.redhat.com/show_bug.cgi?id=2007255 or related bzs. I am curious how this was configured here in the case.

Comment 20 Matthias Runge 2023-08-17 08:30:12 UTC

Chris, can you please comment or check regarding comment 19?

Comment 23 Chris Janiszewski 2023-08-23 15:23:40 UTC

Hi Matthias,

Thanks for jumping on this and sorry for the delay in responding the current stf configuration is:
    # STF configuration
    ceilometer::agent::polling::polling_interval: 30
    ceilometer::agent::polling::polling_meters:
    - cpu
    - disk.*
    - ip.*
    - image.*
    - memory
    - memory.*
    - network.*
    - perf.*
    - port
    - port.*
    - switch
    - switch.*
    - storage.*
    - volume.*

    # to avoid filling the memory buffers if disconnected from the message bus
    # note: this may need an adjustment if there are many metrics to be sent.
    collectd::plugin::amqp1::send_queue_limit: 5000

    # receive extra information about virtual memory
    collectd::plugin::vmem::verbose: true 

    # provide name and uuid in addition to hostname for better correlation
    # to ceilometer data
    collectd::plugin::virt::hostname_format: "name uuid hostname"

    # provide the human-friendly name of the virtual instance
    collectd::plugin::virt::plugin_instance_format: metadata

    # set memcached collectd plugin to report its metrics by hostname
    # rather than host IP, ensuring metrics in the dashboard remain uniform
    collectd::plugin::memcached::instances:
      local:
        host: "%{hiera('fqdn_canonical')}"
        port: 11211


We generally have a stable STF/OCP platform (metrics receiver) and the message should be consistently received by it. We haven't had outage there for at least a year.
We are also planning to move to OSP17.1 in the near future. Let me know if there are any adjustment you would want me to make in the new deployment.

Comment 24 Matthias Runge 2023-08-24 14:52:06 UTC

Out of curiosity, would you be able to provide me the stf-connectors.yaml (make sure there are only two CollectdAmqpInstances) and please also do a 
podman inspect collectd | grep Memory
It should show a limit for "Memory", if it shows 0, there is no limit set. If there is a limit set, either podman resource limitations don't work, or you are looking for a different service to blame to cause oomkilling.

Comment 25 Martin Magr 2023-09-06 13:11:12 UTC

Is there a message consumer on your STF side? This seems like sensubility keeps generating messages which are not received.

Comment 26 Chris Janiszewski 2023-09-06 14:01:43 UTC

Thanks for looking into it, here are the requested data points:

(undercloud) [stack@bgp-undercloud templates]$ cat /home/stack/templates/enable-stf.yaml
resource_registry:
  OS::TripleO::Services::Collectd: /usr/share/openstack-tripleo-heat-templates/deployment/metrics/collectd-container-puppet.yaml

parameter_defaults:
    MetricsQdrConnectors:
        - host: default-interconnect-5671-service-telemetry.apps.ocp-bm.openinfra.lab
          port: 443
          role: edge
          verifyHostname: false
          sslProfile: sslProfile

    MetricsQdrSSLProfiles:
        - name: sslProfile
          caCertFileContent: |
            -----BEGIN CERTIFICATE-----
<snip>
            -----END CERTIFICATE-----

    # only send to STF, not other publishers
    EventPipelinePublishers: []
    PipelinePublishers: []

    # manage the polling and pipeline configuration files for Ceilometer agents
    ManagePolling: true
    ManagePipeline: true

    # enable Ceilometer metrics and events
    CeilometerQdrPublishMetrics: true
    CeilometerQdrPublishEvents: true

    # enable collection of API status
    CollectdEnableSensubility: true
    CollectdSensubilityTransport: amqp1

    # enable collection of containerized service metrics
    CollectdEnableLibpodstats: true

    # set collectd overrides for higher telemetry resolution and extra plugins
    # to load
    CollectdConnectionType: amqp1
    CollectdAmqpInterval: 5
    CollectdDefaultPollingInterval: 5
    CollectdExtraPlugins:
    - vmem

    # set standard prefixes for where metrics and events are published to QDR
    MetricsQdrAddresses:
    - prefix: 'collectd'
      distribution: multicast
    - prefix: 'anycast/ceilometer'
      distribution: multicast
##################################

[root@leaf1-controller-2 ~]# podman inspect collectd | grep Memory
               "Memory": 0,
               "KernelMemory": 0,
               "MemoryReservation": 0,
               "MemorySwap": 0,
               "MemorySwappiness": 0,


It looks like the is no limits on memory. We have followed the docs 
##################################

(undercloud) [stack@bgp-undercloud templates]$ cat ./leaf1/site-name.yaml
parameter_defaults:
  NovaComputeAvailabilityZone: leaf1
  ControllerExtraConfig:
    nova::availability_zone::default_schedule_zone: leaf1
    tripleo::profile::base::metrics::collectd::amqp_host: "%{hiera('internal_api')}"
    tripleo::profile::base::metrics::qdr::listener_addr: "%{hiera('internal_api')}" 
    nova::compute::force_config_drive: true
  ComputeLeaf1ExtraConfig:
    tripleo::profile::base::metrics::collectd::amqp_host: "%{hiera('internal_api')}"
    tripleo::profile::base::metrics::qdr::listener_addr: "%{hiera('internal_api')}"
    nova::compute::force_config_drive: true
  NovaCrossAZAttach: false
  CinderStorageAvailabilityZone: leaf1
  GlanceBackendID: ceph
  CeilometerQdrEventsConfig:
    driver: amqp
    topic: leaf1-event

  CeilometerQdrMetricsConfig:
    driver: amqp
    topic: leaf1-metering

  CollectdAmqpInstances:
    leaf1-notify:
      notify: true
      format: JSON
      presettle: false
    leaf1-telemetry:
      format: JSON
      presettle: false

  CollectdSensubilityResultsChannel: sensubility/leaf1-telemetry

#############################
@Martin - is there a way to verify the consumer on the STF side? Here are the pods and services that have been defined there:

[cjanisze@fedora-vm ~]$ oc get pods -n service-telemetry
NAME                                                     READY   STATUS    RESTARTS         AGE
alertmanager-default-0                                   3/3     Running   0                186d
alertmanager-default-1                                   3/3     Running   0                186d
central-bbb6c567-xxzcn                                   1/1     Running   0                55d
default-interconnect-848f695747-dhfdw                    1/1     Running   0                6d1h
default-interconnect-848f695747-x2m6t                    1/1     Running   0                6d1h
default-leaf1-ceil-event-smartgateway-766694665-6v7js    2/2     Running   8 (6d1h ago)     186d
default-leaf1-ceil-event-smartgateway-766694665-mz6tw    2/2     Running   7 (6d1h ago)     186d
default-leaf1-ceil-meter-smartgateway-7cbd89456-nfjbb    3/3     Running   3108 (19m ago)   186d
default-leaf1-ceil-meter-smartgateway-7cbd89456-xm55r    3/3     Running   3130 (33m ago)   186d
default-leaf1-coll-event-smartgateway-6cbbcdcff4-fzffs   2/2     Running   6 (6d1h ago)     186d
default-leaf1-coll-event-smartgateway-6cbbcdcff4-wzllb   2/2     Running   6 (6d1h ago)     186d
default-leaf1-coll-meter-smartgateway-868d4b9744-hb825   3/3     Running   9 (6d1h ago)     186d
default-leaf1-coll-meter-smartgateway-868d4b9744-wfq5d   3/3     Running   6 (6d1h ago)     186d
default-leaf1-sens-meter-smartgateway-5f6f85457b-88dfz   3/3     Running   6 (6d1h ago)     186d
default-leaf1-sens-meter-smartgateway-5f6f85457b-zxssn   3/3     Running   7 (6d1h ago)     186d
default-leaf2-ceil-event-smartgateway-7f6cf64d66-dlfxg   2/2     Running   8 (6d1h ago)     186d
default-leaf2-ceil-event-smartgateway-7f6cf64d66-wxrq4   2/2     Running   7 (6d1h ago)     186d
default-leaf2-ceil-meter-smartgateway-64bdc557b6-2mdxd   3/3     Running   5 (6d1h ago)     186d
default-leaf2-ceil-meter-smartgateway-64bdc557b6-ktdqn   3/3     Running   6 (6d1h ago)     186d
default-leaf2-coll-event-smartgateway-5859b47dcc-gs7kd   2/2     Running   6 (6d1h ago)     186d
default-leaf2-coll-event-smartgateway-5859b47dcc-hl4ft   2/2     Running   7 (6d1h ago)     186d
default-leaf2-coll-meter-smartgateway-66f88d67c7-jqq6q   3/3     Running   7 (6d1h ago)     186d
default-leaf2-coll-meter-smartgateway-66f88d67c7-ms28c   3/3     Running   8 (6d1h ago)     186d
default-leaf2-sens-meter-smartgateway-f8f88cc99-lxg8l    3/3     Running   7 (6d1h ago)     186d
default-leaf2-sens-meter-smartgateway-f8f88cc99-nnjb4    3/3     Running   5 (6d1h ago)     186d
default-leaf3-ceil-event-smartgateway-5874f8fdcb-8pzt5   2/2     Running   8 (6d1h ago)     186d
default-leaf3-ceil-event-smartgateway-5874f8fdcb-ghtqd   2/2     Running   7 (6d1h ago)     186d
default-leaf3-ceil-meter-smartgateway-5d59f7669c-dvcxh   3/3     Running   8 (6d1h ago)     186d
default-leaf3-ceil-meter-smartgateway-5d59f7669c-ngsvn   3/3     Running   7 (6d1h ago)     186d
default-leaf3-coll-event-smartgateway-57c7cc46d8-frcr2   2/2     Running   4 (6d1h ago)     186d
default-leaf3-coll-event-smartgateway-57c7cc46d8-z2vxg   2/2     Running   6 (6d1h ago)     186d
default-leaf3-coll-meter-smartgateway-6f97db9455-ktsnl   3/3     Running   8 (6d1h ago)     186d
default-leaf3-coll-meter-smartgateway-6f97db9455-ndrgx   3/3     Running   9 (6d1h ago)     186d
default-leaf3-sens-meter-smartgateway-866468459c-6hgg5   3/3     Running   9 (6d1h ago)     186d
default-leaf3-sens-meter-smartgateway-866468459c-zn5d4   3/3     Running   8 (6d1h ago)     186d
elastic-operator-96d99cbff-7zcp9                         1/1     Running   0                40d
elasticsearch-es-default-0                               1/1     Running   0                98d
elasticsearch-es-default-1                               1/1     Running   0                98d
elasticsearch-es-default-2                               1/1     Running   0                98d
filebeat-beat-filebeat-75c8f9cd44-gf6z6                  1/1     Running   0                152d
grafana-deployment-575f495d4f-zx79d                      2/2     Running   0                186d
grafana-operator-controller-manager-58596df79c-s8k8p     2/2     Running   0                186d
interconnect-operator-7c8b5b859-6tzw6                    1/1     Running   0                8d
kibana-kb-d4b4589d8-5m549                                1/1     Running   0                152d
prometheus-default-0                                     3/3     Running   1 (186d ago)     186d
prometheus-default-1                                     3/3     Running   1 (186d ago)     186d
prometheus-operator-b5d479c56-5nvch                      1/1     Running   0                186d
scanner-745849d47d-4mb5t                                 1/1     Running   0                55d
scanner-745849d47d-qlc87                                 1/1     Running   0                55d
scanner-db-7644569bcc-6f9sj                              1/1     Running   0                55d
service-telemetry-operator-5d4d5784df-459sz              1/1     Running   0                148d
smart-gateway-operator-6ffddd594c-4qjrd                  1/1     Running   0                186d
[cjanisze@fedora-vm ~]$ oc get svc -n service-telemetry
NAME                                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                         AGE
alertmanager-operated                                 ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP                      189d
central                                               ClusterIP   172.30.60.212    <none>        443/TCP                                         104d
default-alertmanager-proxy                            ClusterIP   172.30.49.177    <none>        9095/TCP                                        189d
default-interconnect                                  ClusterIP   172.30.45.144    <none>        5672/TCP,8672/TCP,55671/TCP,5671/TCP,5673/TCP   189d
default-leaf1-ceil-meter                              ClusterIP   172.30.196.1     <none>        8083/TCP                                        189d
default-leaf1-coll-meter                              ClusterIP   172.30.115.144   <none>        8083/TCP                                        189d
default-leaf1-sens-meter                              ClusterIP   172.30.248.212   <none>        8083/TCP                                        189d
default-leaf2-ceil-meter                              ClusterIP   172.30.28.229    <none>        8083/TCP                                        189d
default-leaf2-coll-meter                              ClusterIP   172.30.112.173   <none>        8083/TCP                                        189d
default-leaf2-sens-meter                              ClusterIP   172.30.183.111   <none>        8083/TCP                                        189d
default-leaf3-ceil-meter                              ClusterIP   172.30.245.213   <none>        8083/TCP                                        189d
default-leaf3-coll-meter                              ClusterIP   172.30.223.58    <none>        8083/TCP                                        189d
default-leaf3-sens-meter                              ClusterIP   172.30.139.244   <none>        8083/TCP                                        189d
default-prometheus-proxy                              ClusterIP   172.30.203.240   <none>        9092/TCP                                        189d
elastic-operator-service                              ClusterIP   172.30.198.254   <none>        443/TCP                                         40d
elasticsearch-es-default                              ClusterIP   None             <none>        9200/TCP                                        189d
elasticsearch-es-http                                 ClusterIP   172.30.142.242   <none>        9200/TCP                                        189d
elasticsearch-es-internal-http                        ClusterIP   172.30.4.133     <none>        9200/TCP                                        189d
elasticsearch-es-transport                            ClusterIP   None             <none>        9300/TCP                                        189d
filebeat-service                                      ClusterIP   172.30.255.235   <none>        9000/TCP                                        160d
grafana-operator-controller-manager-metrics-service   ClusterIP   172.30.203.90    <none>        8443/TCP                                        189d
grafana-service                                       ClusterIP   172.30.192.91    <none>        3000/TCP,3002/TCP                               189d
kibana-kb-http                                        ClusterIP   172.30.119.244   <none>        5601/TCP                                        172d
prometheus-operated                                   ClusterIP   None             <none>        9090/TCP                                        189d
scanner                                               ClusterIP   172.30.108.19    <none>        8080/TCP,8443/TCP                               104d
scanner-db                                            ClusterIP   172.30.106.17    <none>        5432/TCP                                        104d



Also we are probably a few days away from getting another oomkill. I am personally not convinced it's anything to do with STF. Here is the current top mem consumers chart:

top - 09:55:09 up 25 days, 21:37,  2 users,  load average: 3.20, 2.96, 2.82
Tasks: 1141 total,   1 running, 1140 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.2 us,  0.9 sy,  0.0 ni, 96.3 id,  0.0 wa,  0.2 hi,  0.3 si,  0.0 st
MiB Mem : 127617.1 total,  23697.7 free,  91824.3 used,  12095.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  34684.7 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                          
   9370 42409     20   0 4197036   1.8g  11016 S   0.0   1.5 383:05.72 collectd-sensub                                                                                                                                  
   8045 42457     20   0 3605320   1.6g   4324 S   0.0   1.3 120:18.04 memcached                                                                                                                                        
   8020 42465     20   0 1398060 899100  12612 S   5.2   0.7 769:56.00 qdrouterd                                                                                                                                        
 197803 42434     20   0   12.4g 481848 157948 S   1.6   0.4 501:27.60 mariadbd                                                                                                                                         
  16439 42439     20   0 6924132 407688  70056 S  28.3   0.3   6824:37 beam.smp                                                                                                                                         
   7625 48        20   0 1073548 300892  14252 S   0.0   0.2   3:43.85 httpd                                                                                                                                            
   7618 48        20   0 1070476 299120  14304 S   0.0   0.2   3:55.34 httpd                                                                                                                                            
   7621 48        20   0 1072012 289956  14304 S   0.0   0.2   3:30.17 httpd                                                                                                                                            
   7623 48        20   0 1069924 289212  14304 S   0.0   0.2   4:04.41 httpd                                                                                                                                            
   7620 48        20   0 1073548 284772  14304 S   0.0   0.2   3:47.17 httpd                                                                                                                                            
   7656 48        20   0 1071244 280428  14304 S   0.0   0.2   4:00.70 httpd                                                                                                                                            
   7629 48        20   0 1070732 275864  14304 S   0.0   0.2   3:37.76 httpd                                                                                                                                            
   7622 48        20   0 1005964 275812  14304 S   0.0   0.2   3:55.26 httpd                                                                                                                                            
   7657 48        20   0 1070220 274980  14300 S   0.0   0.2   3:36.29 httpd                                                                                                                                            
   7630 48        20   0 1071244 274408  14304 S   0.0   0.2   3:43.34 httpd                                                                                                                                            
   1502 openvsw+  10 -10 3076060 270788  31112 S   2.0   0.2 949:48.94 ovs-vswitchd                                                                                                                                     
   7619 48        20   0 1004428 263492  14304 S   0.0   0.2   3:42.83 httpd                                                                                                                                            
   7624 48        20   0 1005708 261104  14304 S   0.0   0.2   3:42.02 httpd                                                                                                                                            
  21199 42435     20   0  263728 246312   8656 S   0.3   0.2   1117:22 neutron-server:                                                                                                                                  
  21203 42435     20   0  263472 246200   8656 S   0.3   0.2   1102:43 neutron-server:                                                                                                                                  
  21202 42435     20   0  261936 244664   8656 S   0.0   0.2   1137:40 neutron-server:                                                                                                                                  
  21196 42435     20   0  261168 243896   8656 S   0.0   0.2   1120:36 neutron-server:                                                                                                                                  
  21197 42435     20   0  260912 243640   8656 S   0.0   0.2   1164:18 neutron-server:                                                                                                                                  
  21206 42435     20   0  260912 243404   8656 S   0.0   0.2   1148:07 neutron-server:                                                                                                                                  
  21205 42435     20   0  259632 242204   8656 S   0.0   0.2   1170:57 neutron-server:                                                                                                                                  
  21204 42435     20   0  258864 241592   8656 S   1.0   0.2   1104:18 neutron-server:                                                                                                                                  
  21212 42435     20   0  245040 224860   5892 S   0.0   0.2  36:54.93 neutron-server:                                                                                                                                  
  21214 42435     20   0  244784 224552   5892 S   0.0   0.2  37:32.92 neutron-server:                                                                                                                                  
  21208 42435     20   0  243504 223348   5892 S   0.0   0.2  38:20.33 neutron-server:                                                                                                                                  
  21215 42435     20   0  243248 223200   5892 S   0.3   0.2  37:57.85 neutron-server:                                                                                                                                  
  21213 42435     20   0  243248 223012   5892 S   0.0   0.2  37:20.17 neutron-server:                                                                                                                                  
  21209 42435     20   0  242992 222916   5892 S   0.0   0.2  36:51.90 neutron-server:                                                                                                                                  
  21211 42435     20   0  242992 222708   5892 S   0.0   0.2  37:46.99 neutron-server:                                                                                                                                  
  21210 42435     20   0  242480 222268   5892 S   0.0   0.2  35:42.12 neutron-server:                                                                                                                                  
  21217 42435     20   0  241968 221380   5536 S   0.0   0.2  56:46.28 neutron-server:                                                                                                                                  
   9427 42405     20   0 5038696 209536   9068 S  13.0   0.2   2982:20 ceilometer-agen                                                                                                                                  
    944 root      20   0  343208 202384 199248 S   0.3   0.2 371:29.47 systemd-journal                                                                                                                                  
  10115 42415     20   0 1155664 192808  24044 S   0.3   0.1 113:48.79 glance-api                                                                                                                                       
   8024 42435     20   0  197940 188288  16464 S   0.0   0.1  18:23.44 /usr/bin/python                                                                                                                                  
  21218 42435     20   0  208432 187540   5040 S   0.0   0.1  10:31.41 neutron-server:                                                                                                                                  
  21216 42435     20   0  206388 185536   5176 S   0.0   0.1   5:01.65 neutron-server:                                                                                                                                  
   3144 root      rt   0  577200 183908  67572 S   1.0   0.1 349:36.59 corosync                                                                                                                                         
   7459 42407     20   0  731992 179312  31236 S   0.0   0.1 115:33.48 httpd                                                                                                                                            
   7460 42407     20   0  734040 178804  31300 S   0.0   0.1 117:26.22 httpd                                                                                                                                            
   7452 42407     20   0  731992 176852  31228 S   0.0   0.1 117:09.87 httpd                                                                                                                                            
   7455 42407     20   0  732248 176768  31236 S   0.0   0.1 117:07.11 httpd                                                                                                                                            
   7456 42407     20   0  731992 176272  31236 S   0.3   0.1 115:57.66 httpd                                                                                                                                            
   7453 42407     20   0  734296 173372  31228 S   0.0   0.1 117:52.21 httpd                                                                                                                                            
   7451 42407     20   0  731736 172840  31236 S   0.0   0.1 116:18.51 httpd                                                                                                                                            
   7458 42407     20   0  732248 171904  31228 S   0.0   0.1 117:12.75 httpd                                                                                                                                            
   8292 42436     20   0  281984 164336  16512 S   0.0   0.1 506:25.72 httpd                                                                                                                                            
   8339 42436     20   0  282088 164132  16388 S   0.0   0.1 511:45.12 httpd                                                                                                                                            
   8340 42436     20   0  282496 162528  16392 S   0.0   0.1 511:59.55 httpd                                                                                                                                            
[root@leaf1-controller-2 ~]# uptime
 09:55:15 up 25 days, 21:37,  2 users,  load average: 2.95, 2.91, 2.81

Here is another output sorted by VIRT:
top - 09:58:48 up 25 days, 21:40,  2 users,  load average: 2.59, 2.65, 2.71
Tasks: 1155 total,   1 running, 1154 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.1 us,  2.8 sy,  0.0 ni, 88.9 id,  0.3 wa,  0.4 hi,  0.5 si,  0.0 st
MiB Mem : 127617.1 total,  23160.5 free,  92345.6 used,  12111.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  34163.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                          
 197803 42434     20   0   12.4g 481860 157948 S   1.0   0.4 501:30.54 mariadbd                                                                                                                                         
  16439 42439     20   0 6933144 412400  70056 S  31.6   0.3   6825:18 beam.smp                                                                                                                                         
   9427 42405     20   0 5038696 209536   9068 S  16.0   0.2   2982:44 ceilometer-agen                                                                                                                                  
   9370 42409     20   0 4197036   1.8g  11016 S   0.0   1.5 383:10.08 collectd-sensub                                                                                                                                  
   8045 42457     20   0 3605320   1.6g   4324 S   0.0   1.3 120:18.93 memcached                                                                                                                                        
   1297 root      20   0 3566900  81744  32960 S   0.0   0.1  83:18.03 podman                                                                                                                                           
   7169 root      20   0 3089108  47696  11032 S   2.6   0.0 645:59.37 collectd                                                                                                                                         
   1502 openvsw+  10 -10 3076060 270788  31112 S   2.6   0.2 949:54.76 ovs-vswitchd                                                                                                                                     
   3159 polkitd   20   0 2982392  22480  18376 S   0.0   0.0   0:00.93 polkitd                                                                                                                                          
 142352 root      20   0 2908944  73920  31744 S   9.8   0.1   0:00.30 podman                                                                                                                                           
 142404 root      20   0 2826508  66496  31384 S   9.1   0.1   0:00.28 podman                                                                                                                                           
 142600 root      20   0 2458104  65228  31484 S   4.9   0.0   0:00.15 podman                                                                                                                                           
 142477 root      20   0 2457592  68444  32252 S   6.2   0.1   0:00.19 podman                                                                                                                                           
 142710 root      20   0 2457508  61844  30256 S   6.2   0.0   0:00.19 podman                                                                                                                                           
 142434 root      20   0 2456996  66936  31020 S   6.2   0.1   0:00.19 podman                                                                                                                                           
 142357 root      20   0 2309360  63768  32088 S   7.5   0.0   0:00.23 podman                                                                                                                                           
 142423 root      20   0 2309276  60792  30200 S   6.2   0.0   0:00.19 podman                                                                                                                                           
 142709 root      20   0 2236568  61868  30308 S   3.9   0.0   0:00.12 podman                                                                                                                                           
 142354 root      20   0 2162920  60820  31588 S   6.8   0.0   0:00.21 podman                                                                                                                                           
 142346 root      20   0 2162152  63840  31168 S   7.2   0.0   0:00.22 podman                                                                                                                                           
 142375 root      20   0 2014348  61060  29420 S   6.2   0.0   0:00.19 podman                                                                                                                                           
   3155 root      20   0 1978756  99800  15012 S   0.0   0.1  64:32.87 pcsd                                                                                                                                             
 142419 root      20   0 1940956  61324  31604 S   5.5   0.0   0:00.17 podman                                                                                                                                           
 142353 root      20   0 1940444  62500  31668 S   6.5   0.0   0:00.20 podman                                                                                                                                           
 142363 root      20   0 1940188  60644  32052 S   6.2   0.0   0:00.19 podman                                                                                                                                           
 142460 root      20   0 1866712  59000  31412 S   4.9   0.0   0:00.15 podman                                                                                                                                           
   8020 42465     20   0 1398060 899172  12612 S   2.9   0.7 770:02.11 qdrouterd                                                                                                                                        
  10115 42415     20   0 1155664 192808  24044 S   0.0   0.1 113:50.35 glance-api                                                                                                                                       
   6991 994       20   0 1141328  33512   3800 S   4.2   0.0   2456:07 haproxy                                                                                                                                          
  10120 42415     20   0 1131996 156244  24516 S   0.0   0.1 109:06.41 glance-api                                                                                                                                       
  60706 42435     20   0 1127560  10508   1272 S   0.0   0.0   0:21.28 haproxy                                                                                                                                          
  60942 42435     20   0 1127560  10640   1408 S   0.0   0.0   0:21.62 haproxy                                                                                                                                          
  61251 42435     20   0 1127560  10604   1372 S   0.0   0.0   0:21.35 haproxy                                                                                                                                          
  61875 42435     20   0 1127560  10496   1264 S   0.0   0.0   0:21.42 haproxy                                                                                                                                          
  62118 42435     20   0 1127560  10572   1344 S   0.0   0.0   0:21.54 haproxy                                                                                                                                          
  62263 42435     20   0 1127560   8536   1344 S   0.0   0.0   0:21.56 haproxy                                                                                                                                          
  62549 42435     20   0 1127560  10640   1408 S   0.0   0.0   0:21.36 haproxy                                                                                                                                          
  62899 42435     20   0 1127560  10528   1296 S   0.0   0.0   0:21.52 haproxy                                                                                                                                          
  63090 42435     20   0 1127560  10604   1376 S   0.0   0.0   0:21.25 haproxy                                                                                                                                          
  10119 42415     20   0 1122524 137244  24072 S   0.0   0.1 105:53.89 glance-api                                                                                                                                       
  10110 42415     20   0 1116892 139144  23732 S   0.0   0.1 105:05.50 glance-api                                                                                                                                       
  10109 42415     20   0 1113924 155660  24288 S   0.0   0.1 106:45.28 glance-api                                                                                                                                       
   7620 48        20   0 1073548 284772  14304 S   0.0   0.2   3:47.18 httpd                                                                                                                                            
   7625 48        20   0 1073548 300892  14252 S   0.0   0.2   3:43.86 httpd                                                                                                                                            
   7621 48        20   0 1072012 289956  14304 S   0.0   0.2   3:30.17 httpd                                                                                                                                            
   7630 48        20   0 1071244 274408  14304 S   0.0   0.2   3:43.35 httpd                                                                                                                                            
   7656 48        20   0 1071244 280428  14304 S   0.0   0.2   4:00.71 httpd                                                                                                                                            
   7629 48        20   0 1070732 275864  14304 S   0.0   0.2   3:37.77 httpd                                                                                                                                            
   7618 48        20   0 1070476 299120  14304 S   0.0   0.2   3:55.34 httpd                                                                                                                                            
   7657 48        20   0 1070220 274980  14300 S   0.0   0.2   3:36.30 httpd                                                                                                                                            
   7623 48        20   0 1069924 289212  14304 S   0.0   0.2   4:04.42 httpd                                                                                                                                            
  10111 42415     20   0 1056476 143328  23732 S   0.3   0.1  98:41.37 glance-api                                                                                                                                       
   7622 48        20   0 1005964 275812  14304 S   0.0   0.2   3:55.27 httpd                                                                                                                                            
   7624 48        20   0 1005708 261104  14304 S   0.0   0.2   3:42.03 httpd                                                                                                                                            
   7619 48        20   0 1004428 263492  14304 S   0.0   0.2   3:42.84 httpd                                                                                                                                            
  10105 42415     20   0  990684 134424  21256 S   0.0   0.1 102:04.90 glance-api 

and another sorted by RES

top - 09:59:46 up 25 days, 21:41,  2 users,  load average: 2.22, 2.52, 2.66
Tasks: 1137 total,   3 running, 1134 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.9 us,  0.8 sy,  0.0 ni, 95.7 id,  0.0 wa,  0.2 hi,  0.4 si,  0.0 st
MiB Mem : 127617.1 total,  23664.7 free,  91839.2 used,  12113.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  34669.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                          
   9370 42409     20   0 4197036   1.8g  11016 S   0.0   1.5 383:10.49 collectd-sensub                                                                                                                                  
   8045 42457     20   0 3605320   1.6g   4324 S   0.0   1.3 120:19.16 memcached                                                                                                                                        
   8020 42465     20   0 1398060 899172  12612 S   4.5   0.7 770:03.75 qdrouterd                                                                                                                                        
 197803 42434     20   0   12.4g 481876 157948 S   1.0   0.4 501:31.32 mariadbd                                                                                                                                         
  16439 42439     20   0 6935696 408932  70056 S  17.9   0.3   6825:29 beam.smp                                                                                                                                         
   7625 48        20   0 1073548 300892  14252 S   0.0   0.2   3:43.86 httpd                                                                                                                                            
   7618 48        20   0 1070476 299120  14304 S   0.0   0.2   3:55.35 httpd                                                                                                                                            
   7621 48        20   0 1072012 289956  14304 S   0.0   0.2   3:30.18 httpd                                                                                                                                            
   7623 48        20   0 1069924 289212  14304 S   0.0   0.2   4:04.42 httpd                                                                                                                                            
   7620 48        20   0 1073548 284772  14304 S   0.0   0.2   3:47.18 httpd                                                                                                                                            
   7656 48        20   0 1071244 280428  14304 S   0.0   0.2   4:00.71 httpd                                                                                                                                            
   7629 48        20   0 1070732 275864  14304 S   0.0   0.2   3:37.77 httpd                                                                                                                                            
   7622 48        20   0 1005964 275812  14304 S   0.0   0.2   3:55.27 httpd                                                                                                                                            
   7657 48        20   0 1070220 274980  14300 S   0.0   0.2   3:36.30 httpd                                                                                                                                            
   7630 48        20   0 1071244 274408  14304 S   0.0   0.2   3:43.35 httpd                                                                                                                                            
   1502 openvsw+  10 -10 3076060 270788  31112 S   2.9   0.2 949:56.27 ovs-vswitchd                                                                                                                                     
   7619 48        20   0 1004428 263492  14304 S   0.0   0.2   3:42.84 httpd                                                                                                                                            
   7624 48        20   0 1005708 261104  14304 S   0.0   0.2   3:42.03 httpd                                                                                                                                            
  21199 42435     20   0  263728 246312   8656 S   4.5   0.2   1117:35 neutron-server:                                                                                                                                  
  21203 42435     20   0  263472 246200   8656 S   0.0   0.2   1102:52 neutron-server:                                                                                                                                  
  21202 42435     20   0  261936 244664   8656 S  16.6   0.2   1137:50 neutron-server:                                                                                                                                  
  21196 42435     20   0  261168 243896   8656 S   0.3   0.2   1120:46 neutron-server:                                                                                                                                  
  21197 42435     20   0  260912 243640   8656 S   0.0   0.2   1164:24 neutron-server:                                                                                                                                  
  21206 42435     20   0  260912 243404   8656 S   0.3   0.2   1148:17 neutron-server:                                                                                                                                  
  21205 42435     20   0  259632 242204   8656 S   0.3   0.2   1171:04 neutron-server:                                                                                                                                  
  21204 42435     20   0  258864 241592   8656 S   0.3   0.2   1104:23 neutron-server:                                                                                                                                  
  21212 42435     20   0  245040 224860   5892 S   0.0   0.2  36:55.16 neutron-server:                                                                                                                                  
  21214 42435     20   0  244784 224552   5892 S   0.0   0.2  37:33.18 neutron-server:                                                                                                                                  
  21208 42435     20   0  243504 223348   5892 S   0.0   0.2  38:20.55 neutron-server:                                                                                                                                  
  21215 42435     20   0  243248 223200   5892 S   0.0   0.2  37:58.07 neutron-server:                                                                                                                                  
  21213 42435     20   0  243248 223012   5892 S   0.0   0.2  37:20.38 neutron-server:                                                                                                                                  
  21209 42435     20   0  242992 222916   5892 S   0.0   0.2  36:52.13 neutron-server:                                                                                                                                  
  21211 42435     20   0  242992 222708   5892 S   0.0   0.2  37:47.21 neutron-server:                                                                                                                                  
  21210 42435     20   0  242480 222268   5892 S   0.3   0.2  35:42.35 neutron-server:                                                                                                                                  
  21217 42435     20   0  241968 221380   5536 S   0.0   0.2  56:46.38 neutron-server:                                                                                                                                  
    944 root      20   0  359592 217956 214820 S   0.3   0.2 371:32.55 systemd-journal                                                                                                                                  
   9427 42405     20   0 5038696 209536   9068 S  15.9   0.2   2982:50 ceilometer-agen                                                                                                                                  
  10115 42415     20   0 1155664 192808  24044 S   0.0   0.1 113:50.46 glance-api                                                                                                                                       
   8024 42435     20   0  197940 188288  16464 S   0.0   0.1  18:23.58 /usr/bin/python                                                                                                                                  
  21218 42435     20   0  208432 187540   5040 S   0.0   0.1  10:31.49 neutron-server:                                                                                                                                  
  21216 42435     20   0  206388 185536   5176 S   0.0   0.1   5:01.68 neutron-server:                                                                                                                                  
   3144 root      rt   0  577200 183908  67572 S   1.0   0.1 349:39.09 corosync                                                                                                                                         
   7459 42407     20   0  731992 179312  31236 S   0.0   0.1 115:34.63 httpd                                                                                                                                            
   7460 42407     20   0  734040 178804  31300 S   0.0   0.1 117:26.49 httpd                                                                                                                                            
   7452 42407     20   0  731992 176852  31228 S   0.3   0.1 117:10.88 httpd                                                                                                                                            
   7455 42407     20   0  732248 176768  31236 S   0.0   0.1 117:07.42 httpd                                                                                                                                            
   7456 42407     20   0  731992 176272  31236 S   0.0   0.1 115:59.28 httpd                                                                                                                                            
   7453 42407     20   0  734296 173372  31228 S   0.3   0.1 117:53.31 httpd                                                                                                                                            
   7451 42407     20   0  731736 172840  31236 S   0.0   0.1 116:18.79 httpd                                                                                                                                            
   7458 42407     20   0  732248 171904  31228 S   0.0   0.1 117:14.47 httpd                                                                                                                                            
   8292 42436     20   0  281984 164336  16512 S   1.0   0.1 506:30.43 httpd                                                                                                                                            
   8339 42436     20   0  282088 164132  16388 S   3.2   0.1 511:50.09 httpd                                                                                                                                            
   8340 42436     20   0  282496 162528  16392 S   0.3   0.1 512:04.63 httpd                                                                                                                                            
   8290 42436     20   0  281064 162520  16396 S   2.6   0.1 508:41.25 httpd                                                                                                                                            
   8293 42436     20   0  280808 162464  16396 S   0.3   0.1 513:27.04 httpd                                                                                                                                            
   8337 42436     20   0  281216 161788  16520 S   0.0   0.1 507:58.67 httpd              

I have 128GB of RAM on these controllers and it take rougly 1 month for them to run out of it, but I don't see any single process to blame.

Comment 27 Chris Janiszewski 2023-09-06 14:17:22 UTC

[root@leaf1-controller-2 ~]# cat /var/lib/config-data/puppet-generated/collectd/etc/collectd.d/10-amqp1.conf
# Generated by Puppet
<LoadPlugin amqp1>
  Globals false
  Interval 5
</LoadPlugin>

<Plugin amqp1>
  <Transport "metrics">
    Host "172.20.12.137"
    Port "5666"
    User "guest"
    Password "guest"
    Address "collectd"
    RetryDelay 1
    SendQueueLimit 5000
    <Instance "leaf1-notify">
      Format "JSON"
      Notify true
      PreSettle false
    </Instance>
    <Instance "leaf1-telemetry">
      Format "JSON"
      PreSettle false
    </Instance>
  </Transport>
</Plugin>

Comment 29 Martin Magr 2023-10-10 14:33:58 UTC

Please update to 17.1 where sensubility has been updated.