Bug 2218295

Summary: OOM error kills the controller node and ultimately the entire cluster - OSP Controller runs out of memory
Product: Red Hat OpenStack Reporter: Chris Janiszewski <cjanisze>
Component: openstack-tripleoAssignee: Martin Magr <mmagr>
Status: NEW --- QA Contact: myadla
Severity: medium Docs Contact:
Priority: medium    
Version: 17.0 (Wallaby)CC: bshephar, drosenfe, jlarriba, mbayer, mburns, mrunge
Target Milestone: z2Keywords: Triaged, ZStream
Target Release: ---Flags: mrunge: needinfo? (cjanisze)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Janiszewski 2023-06-28 15:44:24 UTC
Description of problem:
First I want to apologize if I have misclassified the OpenStack component. It might also be a RHEL issue, so feel free to re-route if needed

RH OSP17.0.1 has stopped working due to OOM error on a single controller node, which triggered galera and rabbitmq to error out.

Jun 28 08:21:20 leaf1-controller-0 kernel: podman invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Jun 28 08:21:20 leaf1-controller-0 kernel: CPU: 3 PID: 1024000 Comm: podman Not tainted 5.14.0-162.22.2.el9_1.x86_64 #1
Jun 28 08:21:20 leaf1-controller-0 kernel: IPv4: martian source 169.254.95.120 from 169.254.95.118, on dev enp0s20u1u5
Jun 28 08:21:20 leaf1-controller-0 kernel: Hardware name: LENOVO Lenovo Flex System x240 M5 Compute Node -[9532AC1]-/-[9532AC1]-, BIOS -[C4E144A-3.10]- 04/09/2020
Jun 28 08:21:20 leaf1-controller-0 kernel: Call Trace:
Jun 28 08:21:20 leaf1-controller-0 kernel: dump_stack_lvl+0x34/0x48
Jun 28 08:21:20 leaf1-controller-0 kernel: ll header: 00000000: ff ff ff ff ff ff 08 94 ef 25 85 cd 08 06
Jun 28 08:21:20 leaf1-controller-0 kernel: dump_header+0x4a/0x201
Jun 28 08:21:20 leaf1-controller-0 kernel: oom_kill_process.cold+0xb/0x10
Jun 28 08:21:20 leaf1-controller-0 kernel: out_of_memory+0xed/0x2d0
Jun 28 08:21:20 leaf1-controller-0 kernel: __alloc_pages_slowpath.constprop.0+0x7cc/0x8a0
Jun 28 08:21:20 leaf1-controller-0 kernel: __alloc_pages+0x1fe/0x230
Jun 28 08:21:20 leaf1-controller-0 kernel: folio_alloc+0x17/0x50
Jun 28 08:21:20 leaf1-controller-0 kernel: __filemap_get_folio+0x1b6/0x330
Jun 28 08:21:20 leaf1-controller-0 kernel: IPv4: martian source 169.254.95.120 from 169.254.95.118, on dev enp0s20u1u5
Jun 28 08:21:20 leaf1-controller-0 kernel: ? do_sync_mmap_readahead+0x14b/0x270
Jun 28 08:21:20 leaf1-controller-0 kernel: ll header: 00000000: ff ff ff ff ff ff 08 94 ef 25 85 cd 08 06
Jun 28 08:21:20 leaf1-controller-0 kernel: filemap_fault+0x454/0x7a0
Jun 28 08:21:20 leaf1-controller-0 kernel: ? next_uptodate_page+0x160/0x1f0
Jun 28 08:21:20 leaf1-controller-0 kernel: ? filemap_map_pages+0x307/0x4a0
Jun 28 08:21:20 leaf1-controller-0 kernel: __xfs_filemap_fault+0x66/0x280 [xfs]
Jun 28 08:21:20 leaf1-controller-0 kernel: __do_fault+0x36/0x110
Jun 28 08:21:20 leaf1-controller-0 kernel: do_read_fault+0xea/0x190
Jun 28 08:21:20 leaf1-controller-0 kernel: do_fault+0x8c/0x2c0
Jun 28 08:21:20 leaf1-controller-0 kernel: __handle_mm_fault+0x3cb/0x750
Jun 28 08:21:20 leaf1-controller-0 kernel: handle_mm_fault+0xc5/0x2a0
Jun 28 08:21:20 leaf1-controller-0 kernel: do_user_addr_fault+0x1bb/0x690
Jun 28 08:21:20 leaf1-controller-0 kernel: exc_page_fault+0x62/0x150
Jun 28 08:21:20 leaf1-controller-0 kernel: asm_exc_page_fault+0x22/0x30
Jun 28 08:21:20 leaf1-controller-0 kernel: RIP: 0033:0x5597f2ea7f91
Jun 28 08:21:20 leaf1-controller-0 kernel: IPv4: martian source 169.254.95.120 from 169.254.95.118, on dev enp0s20u1u5
Jun 28 08:21:20 leaf1-controller-0 kernel: Code: Unable to access opcode bytes at RIP 0x5597f2ea7f67.
Jun 28 08:21:20 leaf1-controller-0 kernel: RSP: 002b:000000c000987870 EFLAGS: 00010246

shortly after pacemaker started failing:
Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Result of monitor operation for galera on galera-bundle-2: Timed Out after 30s (Resource agent did not complete in time)
Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Result of monitor operation for redis on redis-bundle-2: Timed Out after 60s (Resource agent did not complete in time)
Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: notice: High CPU load detected: 921.770020
Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Result of monitor operation for rabbitmq-bundle-2 on leaf1-controller-0: Timed Out after 30s (Remote executor did not respond)
Jun 28 08:21:21 leaf1-controller-0 pacemaker-controld[3253]: error: Lost connection to Pacemaker Remote node rabbitmq-bundle-2
Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: localized_remote_header: Triggered fatal assertion at remote.c:107 : endian == ENDIAN_LOCAL
Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: Invalid message detected, endian mismatch: badadbbd is neither 6d726c3c nor the swab'd 3c6c726d
Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: localized_remote_header: Triggered fatal assertion at remote.c:107 : endian == ENDIAN_LOCAL
Jun 28 08:21:21 leaf1-controller-0 pacemaker-remoted[2]: error: Invalid message detected, endian mismatch: badadbbd is neither 6d726c3c nor the swab'd 3c6c726d

and rabbitmq?
/var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: show_signal_msg: 13 callbacks suppressed
/var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: handle[10844]: segfault at 18 ip 00007f09e061aa8b sp 00007f09b1d809c8 error 4
/var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: ll header: 00000000: ff ff ff ff ff ff 08 94 ef 25 85 cd 08 06
/var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: in libqpid-proton.so.11.13.0[7f09e05ff000+38000]

I couldn't identify what component caused the underlying OOM failure. Maybe a memory leak?

Here are some oom errors:
/var/log/messages:Jun 28 08:21:20 leaf1-controller-0 kernel: podman invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
/var/log/messages:Jun 28 08:25:57 leaf1-controller-0 kernel: ironic-inspecto invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=0, oom_score_adj=0
/var/log/messages:Jun 28 08:52:40 leaf1-controller-0 kernel: httpd invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
/var/log/messages:Jun 28 09:07:54 leaf1-controller-0 kernel: pcsd invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
/var/log/messages:Jun 28 09:22:28 leaf1-controller-0 kernel: pcsd invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
/var/log/messages:Jun 28 10:21:42 leaf1-controller-0 kernel: nova-conductor invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0


Version-Release number of selected component (if applicable):
OSP17.0.1

How reproducible:
Run OSP17.0.1 over longer period of time

Steps to Reproduce:
1. deploy OSP17.0.1
2. run it for long period of time
3.

Actual results:
OOM error causing multiple components to fail

Expected results:
no error / no failure

Additional info:
sosreport from all the controllers.
http://chrisj.cloud/sosreport-leaf1-controller-0-2023-06-28-nwztdxl.tar.xz  <-- failing controller
http://chrisj.cloud/sosreport-leaf1-controller-1-2023-06-28-pskyypk.tar.xz
http://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-28-tgombfj.tar.xz

Comment 1 Brendan Shephard 2023-06-29 09:39:36 UTC
Hey Chris,

Has this happened multiple times? What does memory usage look like on the other two controllers now? Since we have sosreports from all of them, maybe if we generate another sosreport now on all of them and compared the process lists to see which process is consuming more memory compared to the first sosreports?  

But we need to narrow it down a little to determine which process is to blame and then we can set the correct component on this BZ and assign to the relevant team.

Comment 2 Chris Janiszewski 2023-06-29 14:54:24 UTC
Hey Brendan,

Thanks for looking into it. I do understand it would make sense to nail down the service. Unfortunately after recovering yesterday and then running for the all day, I have hit the OOM error on 2 of my controllers and I am not able to get the sosreport from there anymore.
I was able to snap the sos report on the last surviving controller and uploaded it in here:
https://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-28-tgombfj-day-later.tar.xz


Few more details:
- this happens roughly every 30 days or so .. but I only opened a BZ now (even though we've been running this cluster for few months).
- we've been running long term OSP cluster since OSP8 on the same hardware .. and this is the first one we notice running OOM, for example our OSP16.1 cluster had an uptime of over 1 year.
- I cannot identify myself which service is causing this issue, hence I opened the general BZ. For example:
 
Wed Jun 28 10:57:26 AM EDT 2023
Top 10 Processes by MEM %

Avail	Active	Total	Percent Avail
36070MB	476MB	36546MB	98.6975

    USER   %CPU   %MEM      RSZ     COMMAND
   42405    5.2    0.3   391.23 MB  ceilometer-agent-notification:
      48    0.0    0.2   291.91 MB  horizon
openvsw+    2.7    0.2  290.969 MB  ovs-vswitchd
      48    0.0    0.2  290.551 MB  horizon
      48    0.0    0.2  290.195 MB  horizon
      48    0.0    0.2  288.613 MB  horizon
      48    0.0    0.2  282.766 MB  horizon
      48    0.0    0.2  281.914 MB  horizon
      48    0.0    0.2   280.18 MB  horizon

== Last Half Hour ==

Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 	06/28/2023 	_x86_64_	(32 CPU)

12:00:01 AM kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
10:30:01 AM   4667220   9570436  73866756     56.53       764   1181644  49331892     37.75    294572  30733980       516
10:40:01 AM   4280344   9567440  73849692     56.51       832   1598184  49510440     37.89    464112  30933808      1072
10:50:03 AM   4187912   9500648  73909064     56.56       832   1628548  49520260     37.89    465388  31004332      1164
Average:      2127440   6456156  77308899     59.16        59    626764  50970205     39.00    271005  34052353      1059

Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 	06/28/2023 	_x86_64_	(32 CPU)

12:00:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
10:30:01 AM  19608.95  15629.49  45031.34     32.51  47201.35   4947.52     23.34   9202.13    185.12
10:40:01 AM    575.98   1147.37  32192.17      1.63  24212.09      0.00      1.25      2.46    197.07
10:50:03 AM     30.11    991.22  32418.89      0.11  24803.35      0.00      0.00      0.00      0.00
Average:     12606.28   1093.72  31542.23     26.73  27921.93   5308.18  28066.61   6516.09     19.52

== Current 2 Second Intervals ==

Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 	06/28/2023 	_x86_64_	(32 CPU)

10:57:26 AM kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
10:57:28 AM   4099268   9426428  73979236     56.61       832   1644820  49649664     37.99    465932  31114768      2432
10:57:30 AM   4116412   9443568  73962064     56.60       832   1644848  49655652     38.00    465940  31118800      3032
10:57:32 AM   4162188   9488888  73920068     56.57       832   1641200  49494100     37.87    465908  31062028       272
10:57:34 AM   4160564   9487280  73921684     56.57       832   1641208  49497168     37.88    465908  31066836       244
10:57:36 AM   4149744   9476476  73932488     56.58       832   1641224  49498340     37.88    465908  31060616       640
Average:      4137635   9464528  73943108     56.58       832   1642660  49558985     37.92    465919  31084610      1324

Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 	06/28/2023 	_x86_64_	(32 CPU)

10:57:36 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
10:57:38 AM      0.00    470.00   8952.00      0.00   6844.50      0.00      0.00      0.00      0.00
10:57:40 AM     64.00    238.00  20768.00      0.00  16524.00      0.00      0.00      0.00      0.00
10:57:42 AM      0.00    446.00  19351.50      0.00   5276.50      0.00      0.00      0.00      0.00
10:57:44 AM      0.00     34.50  36172.50      0.00  27441.50      0.00      0.00      0.00      0.00
10:57:46 AM      0.00     64.00   8302.50      0.00  13295.00      0.00      0.00      0.00      0.00
Average:        12.80    250.50  18709.30      0.00  13876.30      0.00      0.00      0.00      0.00


- today it seem that I was in the middle up syncing glance images across multiple glance stores when I suddenly lost the 2 controllers

Thanks for looking into it

Comment 3 Chris Janiszewski 2023-06-29 15:57:41 UTC
I was able to capture sosreport from all the controllers after the fresh reboot:
http://chrisj.cloud/sosreport-leaf1-controller-0-2023-06-29-wucsepe.tar.xz
http://chrisj.cloud/sosreport-leaf1-controller-1-2023-06-29-jsthxse.tar.xz
http://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-29-rqqshfx.tar.xz

again 2 of the controllers controller-0 and controller-1 have died earlier today and I was forced to reboot and ended up rebooting all 3.

Comment 4 Brendan Shephard 2023-07-03 04:34:10 UTC
Maybe we need to do something like set up a timer to run on a schedule and create a log file for us. So that we can see which processes are using the most memory and if they are increasing their memory usage over time.

So some timer like (Note everything is run as root here):
[root@fedora ~]# cat << EOF > /etc/systemd/system/memory_usage.timer
[Unit]
Description=Memory usage service
Wants=memory_usage.timer

[Timer]
OnUnitActiveSec=1h

[Install]
WantedBy=multi-user.target
EOF

[root@fedora ~]# cat << EOF > /etc/systemd/system/memory_usage.service
[Unit]
Description=Memory usage service
Wants=memory_usage.timer

[Service]
Type=oneshot
ExecStart=/bin/bash -c 'ps -eo pid,user,cmd,%%mem | grep -E -v 0.0 >> /memory_usage.log'

[Install]
WantedBy=multi-user.target
EOF

[root@fedora ~]# systemctl daemon-reload
[root@fedora ~]# systemctl start memory_usage.timer
[root@fedora ~]# systemctl list-timers
NEXT                        LEFT        LAST                        PASSED       UNIT                         ACTIVATES
Mon 2023-07-03 04:53:44 UTC 22min left  Mon 2023-07-03 03:48:01 UTC 43min ago    dnf-makecache.timer          dnf-makecache.service
Tue 2023-07-04 00:00:00 UTC 19h left    Mon 2023-07-03 00:25:47 UTC 4h 5min ago  logrotate.timer              logrotate.service
Tue 2023-07-04 00:00:00 UTC 19h left    Mon 2023-07-03 00:25:47 UTC 4h 5min ago  unbound-anchor.timer         unbound-anchor.service
Tue 2023-07-04 02:00:01 UTC 21h left    Mon 2023-07-03 02:00:01 UTC 2h 31min ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Sun 2023-07-09 01:00:00 UTC 5 days left Sun 2023-07-02 01:11:17 UTC 1 day 3h ago raid-check.timer             raid-check.service
Mon 2023-07-10 00:53:48 UTC 6 days left Mon 2023-07-03 00:25:47 UTC 4h 5min ago  fstrim.timer                 fstrim.service
-                           -           Mon 2023-07-03 04:27:01 UTC 4min 25s ago memory_usage.timer           memory_usage.service. <<<< New timer


And you can test it by starting the service:
[root@fedora ~]# systemctl start memory_usage.service
[root@fedora ~]# cat /memory_usage.log
    PID USER     CMD                         %MEM
 141454 root     /usr/sbin/nordvpnd           0.2
 141579 root     /usr/lib/systemd/systemd-jo  0.1
 239379 fedora   /usr/bin/nvim --embed        0.1
 239463 fedora   /home/fedora/.local/share/n  1.4
 247348 influxdb /usr/bin/influxd             0.8
 317328 fedora   /tmp/go-build2531535926/b00  0.2
 352604 root     /usr/sbin/smbd --foreground  0.1

And this will show that the timer has been executed now:
[root@fedora ~]# systemctl list-timers
NEXT                        LEFT        LAST                        PASSED       UNIT                         ACTIVATES
Mon 2023-07-03 04:53:44 UTC 20min left  Mon 2023-07-03 03:48:01 UTC 45min ago    dnf-makecache.timer          dnf-makecache.service
Mon 2023-07-03 05:32:11 UTC 58min left  Mon 2023-07-03 04:27:01 UTC 6min ago     memory_usage.timer           memory_usage.service


Which will give us something like this over time:
❯ cat memory_usage.log                                                                                                                                                                                                        fedora@fedora
    PID USER     CMD                         %MEM
 141454 root     /usr/sbin/nordvpnd           0.2
 141579 root     /usr/lib/systemd/systemd-jo  0.1
 239379 fedora   /usr/bin/nvim --embed        0.1
 239463 fedora   /home/fedora/.local/share/n  1.4
 247348 influxdb /usr/bin/influxd             0.7
 317328 fedora   /tmp/go-build2531535926/b00  0.2
 352604 root     /usr/sbin/smbd --foreground  0.1
    PID USER     CMD                         %MEM
 141454 root     /usr/sbin/nordvpnd           0.2
 141579 root     /usr/lib/systemd/systemd-jo  0.1
 239379 fedora   /usr/bin/nvim --embed        0.1
 239463 fedora   /home/fedora/.local/share/n  1.4
 247348 influxdb /usr/bin/influxd             0.7
 317328 fedora   /tmp/go-build2531535926/b00  0.2
 352604 root     /usr/sbin/smbd --foreground  0.1
    PID USER     CMD                         %MEM
 141454 root     /usr/sbin/nordvpnd           0.2
 141579 root     /usr/lib/systemd/systemd-jo  0.1
 239379 fedora   /usr/bin/nvim --embed        0.1
 239463 fedora   /home/fedora/.local/share/n  1.4
 247348 influxdb /usr/bin/influxd             0.7
 317328 fedora   /tmp/go-build2531535926/b00  0.2
 352604 root     /usr/sbin/smbd --foreground  0.1


Hopefully that will help track down the process causing the headaches.

Comment 6 Chris Janiszewski 2023-07-07 15:42:47 UTC
Hi Brendan,

Thanks for looking into it. I have installed the script, but I don't believe it will show us anything valuable. I have attached above a screenshot of the top command, sorted by the %mem column. It doesn't indicate any single service being an issue. At the same time note the amount of free memory for the entire system. I do see a large amount of http services with 0.2 or 0.1 usage. Is it possible that OSP17.0 doesn't terminate/clean up older http sessions ?

Comment 7 Brendan Shephard 2023-07-10 00:17:45 UTC
Hey,

So those aren't individual httpd sessions, they represent the processes running the API services.
I think most services default to running one API worker per CPU core. So for each service, you will end up with one httpd process per CPU core which could definitely consume all of the available memory. This is one thing that we have a note for in all of the service templates:
https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/deployment/neutron/neutron-api-container-puppet.yaml#L62-L72

So we can try tuning those to see if that solves the problem. The fact that they run for a while and then OOM seems like something is continuously growing, but I guess if you have hundreds of httpd processes running it wouldn't help. Maybe we could try setting each services *Workers count to something more rational like 8? 

Then recheck to see if anything stands out over time for memory consumption?

Comment 8 Brendan Shephard 2023-07-17 04:45:32 UTC
Hey Chris, just checking in to see whether we tried adjusting the Workers count for the services to see if that helped with this problem at all?

Comment 9 Chris Janiszewski 2023-07-17 17:49:53 UTC
Hi Brendan,

Sorry for the delay in responding. I agree that having a fixed amount of workers would make more sense. I noticed you've pasted example for the neutron. Would there be a variable in tripleo that would limit all the workers for all the services ? If not, is there a list of service workers that I could inject to my config?

Comment 10 Brendan Shephard 2023-07-20 00:30:55 UTC
There's no single variable no, but you could start with this list:
https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/environments/low-memory-usage.yaml#L3-L13

Try setting all of them to 4 or 8 instead of 1 and see if you still end up having issues.

Comment 11 Chris Janiszewski 2023-07-28 20:27:45 UTC
Hi Brendan,

I have applied the suggested configuration and the memory utilization have dropped a little bit but not much. I am including before and after screenshots of top sorted by the memory usage. Also number of services with 0.1 of memory decreased from 64 to 44:
[root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l                                                                                                                                                   
64     

[root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l
44

Still the memory utilization is rather high and it keeps growing. This is a semi production system, I'd like to see if this will eventually stabilize. We've not seen this in OSP16.X. 
Is there anything else you would want me to validate/modify? If not, let's keep this BZ open for next few weeks to see if the mem utilization keeps going up on this system.

Comment 14 Brendan Shephard 2023-07-30 07:35:28 UTC
Hey,

Yeah, we basically need to narrow this down to some process to make sure the right engineers are involved to troubleshoot.

I would be interested to know if that collectd process is growing in memory utilization? Is it consuming more than 1.6g of memory now?

Comment 15 Chris Janiszewski 2023-07-31 16:38:57 UTC
Hi Brendan,

Over the weekend the number of services that consume at least 0.1 % of memory is back to 58:
[root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l
58

The current top output sorted by memory:
top - 12:34:06 up 17 days, 20:53,  1 user,  load average: 6.37, 6.74, 6.75
Tasks: 1181 total,   4 running, 1176 sleeping,   0 stopped,   1 zombie
%Cpu(s): 18.5 us,  5.7 sy,  0.0 ni, 74.3 id,  0.0 wa,  0.6 hi,  0.8 si,  0.0 st
MiB Mem : 127617.0 total,  34479.8 free,  74696.6 used,  18440.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  51775.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                      
  40123 42409     20   0 9244444   6.7g  11224 S   0.0   5.3  12:12.72 collectd-sensub                                                                                              
  15654 42434     20   0   17.7g   1.2g 158776 S  14.3   1.0   1018:37 mariadbd                                                                                                     
  20841 42439     20   0 6923708 377184  70200 S  14.3   0.3   1679:49 beam.smp                                                                                                     
   1516 openvsw+  10 -10 3081476 275716  31112 S   2.9   0.2 688:50.31 ovs-vswitchd                                                                                                 
 891676 42465     20   0  946976 211728  12680 S   1.6   0.2  80:59.90 qdrouterd                                                                                                    
 141370 42435     20   0  230876 211324   6556 S   0.0   0.2 111:47.92 neutron-server:                                                                                              
 141388 42435     20   0  230364 210888   6556 S   0.6   0.2  95:56.37 neutron-server:                                                                                              
 141376 42435     20   0  229596 210228   6556 S   4.5   0.2  97:38.59 neutron-server:                                                                                              
 141403 42435     20   0  229596 210108   6556 S   0.0   0.2 104:12.78 neutron-server:                                                                                              
 141412 42435     20   0  229596 210056   6568 S   1.9   0.2 102:00.11 neutron-server:                                                                                              
 141381 42435     20   0  229340 209972   6556 R  16.6   0.2 100:57.41 neutron-server:                                                                                              
 141369 42435     20   0  229340 209920   6556 S   0.3   0.2 102:54.30 neutron-server:                                                                                              
 141383 42435     20   0  229340 209824   6556 S   0.0   0.2  98:36.04 neutron-server:                                                                                              
 141489 42435     20   0  227296 207196   5824 S   0.0   0.2   4:12.89 neutron-server:                                                                                              
 141460 42435     20   0  225504 205552   5972 S   0.0   0.2   4:36.00 neutron-server:                                                                                              
 141464 42435     20   0  224736 204784   5972 S   0.0   0.2   4:34.45 neutron-server:                                                                                              
 141453 42435     20   0  224992 204764   5928 S   0.0   0.2   4:34.07 neutron-server:                                                                                              
 141441 42435     20   0  222688 202736   5972 S   0.0   0.2   4:48.89 neutron-server:                                                                                              
 141472 42435     20   0  222688 202736   5972 S   0.0   0.2   4:36.39 neutron-server:                                                                                              
 141447 42435     20   0  222688 202660   5896 S   0.0   0.2   4:42.43 neutron-server:                                                                                              
 141442 42435     20   0  222432 202440   5972 S   0.0   0.2   4:36.07 neutron-server:                                                                                              
 141474 42435     20   0  222432 202268   5972 S   0.0   0.2   4:34.20 neutron-server:                                                                                              
    958 root      20   0  323692 190428 186672 S   1.6   0.1 220:42.12 systemd-journal                                                                                              
 889154 42457     20   0 2649284 188480   4316 S   0.0   0.1  15:25.92 memcached                                                                                                    
 141495 42435     20   0  208348 187532   5264 S   0.0   0.1   1:18.99 neutron-server:                                                                                              
 134148 42435     20   0  196832 187276  16520 S   0.3   0.1   2:15.13 /usr/bin/python                                                                                              
   3127 root      rt   0  577208 183916  67572 S   0.3   0.1 243:03.92 corosync                                                                                                     
 141476 42435     20   0  199392 178760   5412 S   0.0   0.1   0:42.36 neutron-server:                                                                                              
  48358 48        20   0  858248 168844  14092 S   0.0   0.1   0:42.15 httpd                                                                                                        
  48349 48        20   0  856456 167924  14088 S   0.0   0.1   0:34.39 httpd                                                                                                        
 122474 42407     20   0  733276 162280  31420 S   0.0   0.1  11:18.16 httpd                                                                                                        
 122477 42407     20   0  732764 162228  31420 S   0.0   0.1  11:22.72 httpd                                                                                                        
 122476 42407     20   0  732508 161708  31420 S   0.0   0.1  10:59.91 httpd                                                                                                        
 122479 42407     20   0  585556 161036  31420 S   0.0   0.1  11:14.18 httpd                                                                                                        
 122478 42407     20   0  658776 161016  31420 S   0.0   0.1  10:47.51 httpd                                                                                                        
 122475 42407     20   0  585812 160936  31420 S   0.0   0.1  10:41.43 httpd                                                                                                        
 122473 42407     20   0  733020 160568  31420 S   0.3   0.1  11:07.19 httpd                                                                                                        
  48363 48        20   0  928652 158852  14092 S   0.0   0.1   0:36.28 httpd                                                                                                        
  48369 48        20   0  853384 158748  14148 S   0.0   0.1   0:32.64 httpd                                                                                                        
 122472 42407     20   0  511056 158356  31420 S   0.0   0.1  10:46.30 httpd                                                                                                        
  48362 48        20   0  856712 158112  14092 S   0.0   0.1   0:34.74 httpd                                                                                                        
  48367 48        20   0  929932 157652  14092 S   0.0   0.1   0:32.73 httpd                                                                                                        
  48360 48        20   0  856968 157088  14092 S   0.0   0.1   0:39.56 httpd                                                                                                        
  48359 48        20   0  928140 156104  14084 S   0.0   0.1   0:33.38 httpd                                                                                                        
 171815 42436     20   0  277992 150184  16276 S   0.0   0.1  33:42.93 httpd                                                                                                        
  48364 48        20   0  928140 149796  14092 S   0.0   0.1   0:31.35 httpd                                                                                                        
 171818 42436     20   0  277992 149464  16276 S   0.0   0.1  33:55.05 httpd                                                                                                        
 171813 42436     20   0  282860 148868  16276 S   0.0   0.1  33:33.54 httpd                                                                                                        
 171816 42436     20   0  278760 148744  16276 S   3.9   0.1  33:39.95 httpd                                                                                                        
 171811 42436     20   0  282860 148700  16276 S   0.6   0.1  32:53.68 httpd                                                                                                        
 171812 42436     20   0  277736 148544  16276 S   0.0   0.1  33:17.54 httpd                                                                                                        
 171817 42436     20   0  277480 147816  16276 S   0.0   0.1  33:18.56 httpd                                                                                                        
 171814 42436     20   0  277480 147532  16276 S   0.3   0.1  33:21.68 httpd                                                                                                        
 235715 42405     20   0 5035728 146792   8900 S   2.9   0.1 268:59.37 ceilometer-agen                                                                                              
  48368 48        20   0  855688 146744  14092 S   0.0   0.1   0:32.49 httpd                                                                                                        
  48356 48        20   0  631164 145208  14092 S   0.0   0.1   0:29.84 httpd                                                                                                        
  48366 48        20   0  854152 142424  14092 S   0.0   0.1   0:29.60 httpd                                                                                                        
 124073 42407     20   0  286732 133108  30964 S   0.3   0.1   7:33.50 cinder-schedule                                                                                              
 185381 42435     20   0  139412 128512  16216 S   1.0   0.1  55:53.26 neutron-dhcp-ag                                                                                              
 358600 root      20   0  280100 126900  31412 S   1.3   0.1  42:58.64 cinder-volume                                                                                                
 611254 42437     20   0  342592 125272  12024 S   0.0   0.1   6:39.86 octavia-health-                                                                                              
 171200 42436     20   0  253156 124652  15848 S   0.0   0.1   1:07.22 httpd                                                                                                        
 171196 42436     20   0  253156 124364  15824 S   0.0   0.1   1:06.76 httpd                                                                                                        
 171193 42436     20   0  253156 124300  15824 S   0.0   0.1   1:07.15 httpd                                                                                                        
 171199 42436     20   0  253156 124240  15824 S   0.0   0.1   1:07.00 httpd                                                                                                        
 360748 root      20   0  488008 124060  17428 S   0.0   0.1  10:55.18 cinder-volume      



So I have lost 17GB of available memory over the weekend. The collectd is higher in it's usage, but definitely not by 17GB

It's been only 17 days since my last reboot:
[root@leaf1-controller-0 ~]# uptime
 12:37:59 up 17 days, 20:57,  1 user,  load average: 6.23, 6.63, 6.72

It doesn't look like limiting number of workers did the trick.

Comment 16 Brendan Shephard 2023-08-01 02:46:26 UTC
Given that collectd is the only thing that has really grown considerably, let's start by asking the cloudops team if they see any issue with that level of memory consumption and growth.

Comment 17 Michael Bayer 2023-08-01 14:10:32 UTC
what collect modules / plugins are installed / running ?

Comment 18 Matthias Runge 2023-08-08 13:00:41 UTC
collectd grows memory if it can not send data. More material http://matthias-runge.de/2021/03/22/collectd-memory-usage/

Comment 19 Matthias Runge 2023-08-08 13:06:09 UTC
Also: collectd should be limited to 512 megs by tripleo, see e.g https://bugzilla.redhat.com/show_bug.cgi?id=2007255 or related bzs. I am curious how this was configured here in the case.

Comment 20 Matthias Runge 2023-08-17 08:30:12 UTC
Chris, can you please comment or check regarding comment 19?