Bug 2218295
| Summary: | OOM error kills the controller node and ultimately the entire cluster - OSP Controller runs out of memory | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Chris Janiszewski <cjanisze> |
| Component: | openstack-tripleo | Assignee: | Martin Magr <mmagr> |
| Status: | CLOSED WONTFIX | QA Contact: | myadla |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 17.0 (Wallaby) | CC: | bshephar, drosenfe, jlarriba, mbayer, mburns, mmagr, mrunge |
| Target Milestone: | z2 | Keywords: | Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-10-10 14:33:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Chris Janiszewski
2023-06-28 15:44:24 UTC
Hey Chris, Has this happened multiple times? What does memory usage look like on the other two controllers now? Since we have sosreports from all of them, maybe if we generate another sosreport now on all of them and compared the process lists to see which process is consuming more memory compared to the first sosreports? But we need to narrow it down a little to determine which process is to blame and then we can set the correct component on this BZ and assign to the relevant team. Hey Brendan, Thanks for looking into it. I do understand it would make sense to nail down the service. Unfortunately after recovering yesterday and then running for the all day, I have hit the OOM error on 2 of my controllers and I am not able to get the sosreport from there anymore. I was able to snap the sos report on the last surviving controller and uploaded it in here: https://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-28-tgombfj-day-later.tar.xz Few more details: - this happens roughly every 30 days or so .. but I only opened a BZ now (even though we've been running this cluster for few months). - we've been running long term OSP cluster since OSP8 on the same hardware .. and this is the first one we notice running OOM, for example our OSP16.1 cluster had an uptime of over 1 year. - I cannot identify myself which service is causing this issue, hence I opened the general BZ. For example: Wed Jun 28 10:57:26 AM EDT 2023 Top 10 Processes by MEM % Avail Active Total Percent Avail 36070MB 476MB 36546MB 98.6975 USER %CPU %MEM RSZ COMMAND 42405 5.2 0.3 391.23 MB ceilometer-agent-notification: 48 0.0 0.2 291.91 MB horizon openvsw+ 2.7 0.2 290.969 MB ovs-vswitchd 48 0.0 0.2 290.551 MB horizon 48 0.0 0.2 290.195 MB horizon 48 0.0 0.2 288.613 MB horizon 48 0.0 0.2 282.766 MB horizon 48 0.0 0.2 281.914 MB horizon 48 0.0 0.2 280.18 MB horizon == Last Half Hour == Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 06/28/2023 _x86_64_ (32 CPU) 12:00:01 AM kbmemfree kbavail kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 10:30:01 AM 4667220 9570436 73866756 56.53 764 1181644 49331892 37.75 294572 30733980 516 10:40:01 AM 4280344 9567440 73849692 56.51 832 1598184 49510440 37.89 464112 30933808 1072 10:50:03 AM 4187912 9500648 73909064 56.56 832 1628548 49520260 37.89 465388 31004332 1164 Average: 2127440 6456156 77308899 59.16 59 626764 50970205 39.00 271005 34052353 1059 Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 06/28/2023 _x86_64_ (32 CPU) 12:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 10:30:01 AM 19608.95 15629.49 45031.34 32.51 47201.35 4947.52 23.34 9202.13 185.12 10:40:01 AM 575.98 1147.37 32192.17 1.63 24212.09 0.00 1.25 2.46 197.07 10:50:03 AM 30.11 991.22 32418.89 0.11 24803.35 0.00 0.00 0.00 0.00 Average: 12606.28 1093.72 31542.23 26.73 27921.93 5308.18 28066.61 6516.09 19.52 == Current 2 Second Intervals == Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 06/28/2023 _x86_64_ (32 CPU) 10:57:26 AM kbmemfree kbavail kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 10:57:28 AM 4099268 9426428 73979236 56.61 832 1644820 49649664 37.99 465932 31114768 2432 10:57:30 AM 4116412 9443568 73962064 56.60 832 1644848 49655652 38.00 465940 31118800 3032 10:57:32 AM 4162188 9488888 73920068 56.57 832 1641200 49494100 37.87 465908 31062028 272 10:57:34 AM 4160564 9487280 73921684 56.57 832 1641208 49497168 37.88 465908 31066836 244 10:57:36 AM 4149744 9476476 73932488 56.58 832 1641224 49498340 37.88 465908 31060616 640 Average: 4137635 9464528 73943108 56.58 832 1642660 49558985 37.92 465919 31084610 1324 Linux 5.14.0-162.22.2.el9_1.x86_64 (leaf1-controller-0) 06/28/2023 _x86_64_ (32 CPU) 10:57:36 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 10:57:38 AM 0.00 470.00 8952.00 0.00 6844.50 0.00 0.00 0.00 0.00 10:57:40 AM 64.00 238.00 20768.00 0.00 16524.00 0.00 0.00 0.00 0.00 10:57:42 AM 0.00 446.00 19351.50 0.00 5276.50 0.00 0.00 0.00 0.00 10:57:44 AM 0.00 34.50 36172.50 0.00 27441.50 0.00 0.00 0.00 0.00 10:57:46 AM 0.00 64.00 8302.50 0.00 13295.00 0.00 0.00 0.00 0.00 Average: 12.80 250.50 18709.30 0.00 13876.30 0.00 0.00 0.00 0.00 - today it seem that I was in the middle up syncing glance images across multiple glance stores when I suddenly lost the 2 controllers Thanks for looking into it I was able to capture sosreport from all the controllers after the fresh reboot: http://chrisj.cloud/sosreport-leaf1-controller-0-2023-06-29-wucsepe.tar.xz http://chrisj.cloud/sosreport-leaf1-controller-1-2023-06-29-jsthxse.tar.xz http://chrisj.cloud/sosreport-leaf1-controller-2-2023-06-29-rqqshfx.tar.xz again 2 of the controllers controller-0 and controller-1 have died earlier today and I was forced to reboot and ended up rebooting all 3. Maybe we need to do something like set up a timer to run on a schedule and create a log file for us. So that we can see which processes are using the most memory and if they are increasing their memory usage over time.
So some timer like (Note everything is run as root here):
[root@fedora ~]# cat << EOF > /etc/systemd/system/memory_usage.timer
[Unit]
Description=Memory usage service
Wants=memory_usage.timer
[Timer]
OnUnitActiveSec=1h
[Install]
WantedBy=multi-user.target
EOF
[root@fedora ~]# cat << EOF > /etc/systemd/system/memory_usage.service
[Unit]
Description=Memory usage service
Wants=memory_usage.timer
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'ps -eo pid,user,cmd,%%mem | grep -E -v 0.0 >> /memory_usage.log'
[Install]
WantedBy=multi-user.target
EOF
[root@fedora ~]# systemctl daemon-reload
[root@fedora ~]# systemctl start memory_usage.timer
[root@fedora ~]# systemctl list-timers
NEXT LEFT LAST PASSED UNIT ACTIVATES
Mon 2023-07-03 04:53:44 UTC 22min left Mon 2023-07-03 03:48:01 UTC 43min ago dnf-makecache.timer dnf-makecache.service
Tue 2023-07-04 00:00:00 UTC 19h left Mon 2023-07-03 00:25:47 UTC 4h 5min ago logrotate.timer logrotate.service
Tue 2023-07-04 00:00:00 UTC 19h left Mon 2023-07-03 00:25:47 UTC 4h 5min ago unbound-anchor.timer unbound-anchor.service
Tue 2023-07-04 02:00:01 UTC 21h left Mon 2023-07-03 02:00:01 UTC 2h 31min ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Sun 2023-07-09 01:00:00 UTC 5 days left Sun 2023-07-02 01:11:17 UTC 1 day 3h ago raid-check.timer raid-check.service
Mon 2023-07-10 00:53:48 UTC 6 days left Mon 2023-07-03 00:25:47 UTC 4h 5min ago fstrim.timer fstrim.service
- - Mon 2023-07-03 04:27:01 UTC 4min 25s ago memory_usage.timer memory_usage.service. <<<< New timer
And you can test it by starting the service:
[root@fedora ~]# systemctl start memory_usage.service
[root@fedora ~]# cat /memory_usage.log
PID USER CMD %MEM
141454 root /usr/sbin/nordvpnd 0.2
141579 root /usr/lib/systemd/systemd-jo 0.1
239379 fedora /usr/bin/nvim --embed 0.1
239463 fedora /home/fedora/.local/share/n 1.4
247348 influxdb /usr/bin/influxd 0.8
317328 fedora /tmp/go-build2531535926/b00 0.2
352604 root /usr/sbin/smbd --foreground 0.1
And this will show that the timer has been executed now:
[root@fedora ~]# systemctl list-timers
NEXT LEFT LAST PASSED UNIT ACTIVATES
Mon 2023-07-03 04:53:44 UTC 20min left Mon 2023-07-03 03:48:01 UTC 45min ago dnf-makecache.timer dnf-makecache.service
Mon 2023-07-03 05:32:11 UTC 58min left Mon 2023-07-03 04:27:01 UTC 6min ago memory_usage.timer memory_usage.service
Which will give us something like this over time:
❯ cat memory_usage.log fedora@fedora
PID USER CMD %MEM
141454 root /usr/sbin/nordvpnd 0.2
141579 root /usr/lib/systemd/systemd-jo 0.1
239379 fedora /usr/bin/nvim --embed 0.1
239463 fedora /home/fedora/.local/share/n 1.4
247348 influxdb /usr/bin/influxd 0.7
317328 fedora /tmp/go-build2531535926/b00 0.2
352604 root /usr/sbin/smbd --foreground 0.1
PID USER CMD %MEM
141454 root /usr/sbin/nordvpnd 0.2
141579 root /usr/lib/systemd/systemd-jo 0.1
239379 fedora /usr/bin/nvim --embed 0.1
239463 fedora /home/fedora/.local/share/n 1.4
247348 influxdb /usr/bin/influxd 0.7
317328 fedora /tmp/go-build2531535926/b00 0.2
352604 root /usr/sbin/smbd --foreground 0.1
PID USER CMD %MEM
141454 root /usr/sbin/nordvpnd 0.2
141579 root /usr/lib/systemd/systemd-jo 0.1
239379 fedora /usr/bin/nvim --embed 0.1
239463 fedora /home/fedora/.local/share/n 1.4
247348 influxdb /usr/bin/influxd 0.7
317328 fedora /tmp/go-build2531535926/b00 0.2
352604 root /usr/sbin/smbd --foreground 0.1
Hopefully that will help track down the process causing the headaches.
Hi Brendan, Thanks for looking into it. I have installed the script, but I don't believe it will show us anything valuable. I have attached above a screenshot of the top command, sorted by the %mem column. It doesn't indicate any single service being an issue. At the same time note the amount of free memory for the entire system. I do see a large amount of http services with 0.2 or 0.1 usage. Is it possible that OSP17.0 doesn't terminate/clean up older http sessions ? Hey, So those aren't individual httpd sessions, they represent the processes running the API services. I think most services default to running one API worker per CPU core. So for each service, you will end up with one httpd process per CPU core which could definitely consume all of the available memory. This is one thing that we have a note for in all of the service templates: https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/deployment/neutron/neutron-api-container-puppet.yaml#L62-L72 So we can try tuning those to see if that solves the problem. The fact that they run for a while and then OOM seems like something is continuously growing, but I guess if you have hundreds of httpd processes running it wouldn't help. Maybe we could try setting each services *Workers count to something more rational like 8? Then recheck to see if anything stands out over time for memory consumption? Hey Chris, just checking in to see whether we tried adjusting the Workers count for the services to see if that helped with this problem at all? Hi Brendan, Sorry for the delay in responding. I agree that having a fixed amount of workers would make more sense. I noticed you've pasted example for the neutron. Would there be a variable in tripleo that would limit all the workers for all the services ? If not, is there a list of service workers that I could inject to my config? There's no single variable no, but you could start with this list: https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/environments/low-memory-usage.yaml#L3-L13 Try setting all of them to 4 or 8 instead of 1 and see if you still end up having issues. Hi Brendan, I have applied the suggested configuration and the memory utilization have dropped a little bit but not much. I am including before and after screenshots of top sorted by the memory usage. Also number of services with 0.1 of memory decreased from 64 to 44: [root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l 64 [root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l 44 Still the memory utilization is rather high and it keeps growing. This is a semi production system, I'd like to see if this will eventually stabilize. We've not seen this in OSP16.X. Is there anything else you would want me to validate/modify? If not, let's keep this BZ open for next few weeks to see if the mem utilization keeps going up on this system. Hey, Yeah, we basically need to narrow this down to some process to make sure the right engineers are involved to troubleshoot. I would be interested to know if that collectd process is growing in memory utilization? Is it consuming more than 1.6g of memory now? Hi Brendan,
Over the weekend the number of services that consume at least 0.1 % of memory is back to 58:
[root@leaf1-controller-0 ~]# ps -eo %mem,pid,user,cmd | sort -k 1 | grep -v "0\.0" | wc -l
58
The current top output sorted by memory:
top - 12:34:06 up 17 days, 20:53, 1 user, load average: 6.37, 6.74, 6.75
Tasks: 1181 total, 4 running, 1176 sleeping, 0 stopped, 1 zombie
%Cpu(s): 18.5 us, 5.7 sy, 0.0 ni, 74.3 id, 0.0 wa, 0.6 hi, 0.8 si, 0.0 st
MiB Mem : 127617.0 total, 34479.8 free, 74696.6 used, 18440.6 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 51775.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
40123 42409 20 0 9244444 6.7g 11224 S 0.0 5.3 12:12.72 collectd-sensub
15654 42434 20 0 17.7g 1.2g 158776 S 14.3 1.0 1018:37 mariadbd
20841 42439 20 0 6923708 377184 70200 S 14.3 0.3 1679:49 beam.smp
1516 openvsw+ 10 -10 3081476 275716 31112 S 2.9 0.2 688:50.31 ovs-vswitchd
891676 42465 20 0 946976 211728 12680 S 1.6 0.2 80:59.90 qdrouterd
141370 42435 20 0 230876 211324 6556 S 0.0 0.2 111:47.92 neutron-server:
141388 42435 20 0 230364 210888 6556 S 0.6 0.2 95:56.37 neutron-server:
141376 42435 20 0 229596 210228 6556 S 4.5 0.2 97:38.59 neutron-server:
141403 42435 20 0 229596 210108 6556 S 0.0 0.2 104:12.78 neutron-server:
141412 42435 20 0 229596 210056 6568 S 1.9 0.2 102:00.11 neutron-server:
141381 42435 20 0 229340 209972 6556 R 16.6 0.2 100:57.41 neutron-server:
141369 42435 20 0 229340 209920 6556 S 0.3 0.2 102:54.30 neutron-server:
141383 42435 20 0 229340 209824 6556 S 0.0 0.2 98:36.04 neutron-server:
141489 42435 20 0 227296 207196 5824 S 0.0 0.2 4:12.89 neutron-server:
141460 42435 20 0 225504 205552 5972 S 0.0 0.2 4:36.00 neutron-server:
141464 42435 20 0 224736 204784 5972 S 0.0 0.2 4:34.45 neutron-server:
141453 42435 20 0 224992 204764 5928 S 0.0 0.2 4:34.07 neutron-server:
141441 42435 20 0 222688 202736 5972 S 0.0 0.2 4:48.89 neutron-server:
141472 42435 20 0 222688 202736 5972 S 0.0 0.2 4:36.39 neutron-server:
141447 42435 20 0 222688 202660 5896 S 0.0 0.2 4:42.43 neutron-server:
141442 42435 20 0 222432 202440 5972 S 0.0 0.2 4:36.07 neutron-server:
141474 42435 20 0 222432 202268 5972 S 0.0 0.2 4:34.20 neutron-server:
958 root 20 0 323692 190428 186672 S 1.6 0.1 220:42.12 systemd-journal
889154 42457 20 0 2649284 188480 4316 S 0.0 0.1 15:25.92 memcached
141495 42435 20 0 208348 187532 5264 S 0.0 0.1 1:18.99 neutron-server:
134148 42435 20 0 196832 187276 16520 S 0.3 0.1 2:15.13 /usr/bin/python
3127 root rt 0 577208 183916 67572 S 0.3 0.1 243:03.92 corosync
141476 42435 20 0 199392 178760 5412 S 0.0 0.1 0:42.36 neutron-server:
48358 48 20 0 858248 168844 14092 S 0.0 0.1 0:42.15 httpd
48349 48 20 0 856456 167924 14088 S 0.0 0.1 0:34.39 httpd
122474 42407 20 0 733276 162280 31420 S 0.0 0.1 11:18.16 httpd
122477 42407 20 0 732764 162228 31420 S 0.0 0.1 11:22.72 httpd
122476 42407 20 0 732508 161708 31420 S 0.0 0.1 10:59.91 httpd
122479 42407 20 0 585556 161036 31420 S 0.0 0.1 11:14.18 httpd
122478 42407 20 0 658776 161016 31420 S 0.0 0.1 10:47.51 httpd
122475 42407 20 0 585812 160936 31420 S 0.0 0.1 10:41.43 httpd
122473 42407 20 0 733020 160568 31420 S 0.3 0.1 11:07.19 httpd
48363 48 20 0 928652 158852 14092 S 0.0 0.1 0:36.28 httpd
48369 48 20 0 853384 158748 14148 S 0.0 0.1 0:32.64 httpd
122472 42407 20 0 511056 158356 31420 S 0.0 0.1 10:46.30 httpd
48362 48 20 0 856712 158112 14092 S 0.0 0.1 0:34.74 httpd
48367 48 20 0 929932 157652 14092 S 0.0 0.1 0:32.73 httpd
48360 48 20 0 856968 157088 14092 S 0.0 0.1 0:39.56 httpd
48359 48 20 0 928140 156104 14084 S 0.0 0.1 0:33.38 httpd
171815 42436 20 0 277992 150184 16276 S 0.0 0.1 33:42.93 httpd
48364 48 20 0 928140 149796 14092 S 0.0 0.1 0:31.35 httpd
171818 42436 20 0 277992 149464 16276 S 0.0 0.1 33:55.05 httpd
171813 42436 20 0 282860 148868 16276 S 0.0 0.1 33:33.54 httpd
171816 42436 20 0 278760 148744 16276 S 3.9 0.1 33:39.95 httpd
171811 42436 20 0 282860 148700 16276 S 0.6 0.1 32:53.68 httpd
171812 42436 20 0 277736 148544 16276 S 0.0 0.1 33:17.54 httpd
171817 42436 20 0 277480 147816 16276 S 0.0 0.1 33:18.56 httpd
171814 42436 20 0 277480 147532 16276 S 0.3 0.1 33:21.68 httpd
235715 42405 20 0 5035728 146792 8900 S 2.9 0.1 268:59.37 ceilometer-agen
48368 48 20 0 855688 146744 14092 S 0.0 0.1 0:32.49 httpd
48356 48 20 0 631164 145208 14092 S 0.0 0.1 0:29.84 httpd
48366 48 20 0 854152 142424 14092 S 0.0 0.1 0:29.60 httpd
124073 42407 20 0 286732 133108 30964 S 0.3 0.1 7:33.50 cinder-schedule
185381 42435 20 0 139412 128512 16216 S 1.0 0.1 55:53.26 neutron-dhcp-ag
358600 root 20 0 280100 126900 31412 S 1.3 0.1 42:58.64 cinder-volume
611254 42437 20 0 342592 125272 12024 S 0.0 0.1 6:39.86 octavia-health-
171200 42436 20 0 253156 124652 15848 S 0.0 0.1 1:07.22 httpd
171196 42436 20 0 253156 124364 15824 S 0.0 0.1 1:06.76 httpd
171193 42436 20 0 253156 124300 15824 S 0.0 0.1 1:07.15 httpd
171199 42436 20 0 253156 124240 15824 S 0.0 0.1 1:07.00 httpd
360748 root 20 0 488008 124060 17428 S 0.0 0.1 10:55.18 cinder-volume
So I have lost 17GB of available memory over the weekend. The collectd is higher in it's usage, but definitely not by 17GB
It's been only 17 days since my last reboot:
[root@leaf1-controller-0 ~]# uptime
12:37:59 up 17 days, 20:57, 1 user, load average: 6.23, 6.63, 6.72
It doesn't look like limiting number of workers did the trick.
Given that collectd is the only thing that has really grown considerably, let's start by asking the cloudops team if they see any issue with that level of memory consumption and growth. what collect modules / plugins are installed / running ? collectd grows memory if it can not send data. More material http://matthias-runge.de/2021/03/22/collectd-memory-usage/ Also: collectd should be limited to 512 megs by tripleo, see e.g https://bugzilla.redhat.com/show_bug.cgi?id=2007255 or related bzs. I am curious how this was configured here in the case. Chris, can you please comment or check regarding comment 19? Hi Matthias,
Thanks for jumping on this and sorry for the delay in responding the current stf configuration is:
# STF configuration
ceilometer::agent::polling::polling_interval: 30
ceilometer::agent::polling::polling_meters:
- cpu
- disk.*
- ip.*
- image.*
- memory
- memory.*
- network.*
- perf.*
- port
- port.*
- switch
- switch.*
- storage.*
- volume.*
# to avoid filling the memory buffers if disconnected from the message bus
# note: this may need an adjustment if there are many metrics to be sent.
collectd::plugin::amqp1::send_queue_limit: 5000
# receive extra information about virtual memory
collectd::plugin::vmem::verbose: true
# provide name and uuid in addition to hostname for better correlation
# to ceilometer data
collectd::plugin::virt::hostname_format: "name uuid hostname"
# provide the human-friendly name of the virtual instance
collectd::plugin::virt::plugin_instance_format: metadata
# set memcached collectd plugin to report its metrics by hostname
# rather than host IP, ensuring metrics in the dashboard remain uniform
collectd::plugin::memcached::instances:
local:
host: "%{hiera('fqdn_canonical')}"
port: 11211
We generally have a stable STF/OCP platform (metrics receiver) and the message should be consistently received by it. We haven't had outage there for at least a year.
We are also planning to move to OSP17.1 in the near future. Let me know if there are any adjustment you would want me to make in the new deployment.
Out of curiosity, would you be able to provide me the stf-connectors.yaml (make sure there are only two CollectdAmqpInstances) and please also do a podman inspect collectd | grep Memory It should show a limit for "Memory", if it shows 0, there is no limit set. If there is a limit set, either podman resource limitations don't work, or you are looking for a different service to blame to cause oomkilling. Is there a message consumer on your STF side? This seems like sensubility keeps generating messages which are not received. Thanks for looking into it, here are the requested data points:
(undercloud) [stack@bgp-undercloud templates]$ cat /home/stack/templates/enable-stf.yaml
resource_registry:
OS::TripleO::Services::Collectd: /usr/share/openstack-tripleo-heat-templates/deployment/metrics/collectd-container-puppet.yaml
parameter_defaults:
MetricsQdrConnectors:
- host: default-interconnect-5671-service-telemetry.apps.ocp-bm.openinfra.lab
port: 443
role: edge
verifyHostname: false
sslProfile: sslProfile
MetricsQdrSSLProfiles:
- name: sslProfile
caCertFileContent: |
-----BEGIN CERTIFICATE-----
<snip>
-----END CERTIFICATE-----
# only send to STF, not other publishers
EventPipelinePublishers: []
PipelinePublishers: []
# manage the polling and pipeline configuration files for Ceilometer agents
ManagePolling: true
ManagePipeline: true
# enable Ceilometer metrics and events
CeilometerQdrPublishMetrics: true
CeilometerQdrPublishEvents: true
# enable collection of API status
CollectdEnableSensubility: true
CollectdSensubilityTransport: amqp1
# enable collection of containerized service metrics
CollectdEnableLibpodstats: true
# set collectd overrides for higher telemetry resolution and extra plugins
# to load
CollectdConnectionType: amqp1
CollectdAmqpInterval: 5
CollectdDefaultPollingInterval: 5
CollectdExtraPlugins:
- vmem
# set standard prefixes for where metrics and events are published to QDR
MetricsQdrAddresses:
- prefix: 'collectd'
distribution: multicast
- prefix: 'anycast/ceilometer'
distribution: multicast
##################################
[root@leaf1-controller-2 ~]# podman inspect collectd | grep Memory
"Memory": 0,
"KernelMemory": 0,
"MemoryReservation": 0,
"MemorySwap": 0,
"MemorySwappiness": 0,
It looks like the is no limits on memory. We have followed the docs
##################################
(undercloud) [stack@bgp-undercloud templates]$ cat ./leaf1/site-name.yaml
parameter_defaults:
NovaComputeAvailabilityZone: leaf1
ControllerExtraConfig:
nova::availability_zone::default_schedule_zone: leaf1
tripleo::profile::base::metrics::collectd::amqp_host: "%{hiera('internal_api')}"
tripleo::profile::base::metrics::qdr::listener_addr: "%{hiera('internal_api')}"
nova::compute::force_config_drive: true
ComputeLeaf1ExtraConfig:
tripleo::profile::base::metrics::collectd::amqp_host: "%{hiera('internal_api')}"
tripleo::profile::base::metrics::qdr::listener_addr: "%{hiera('internal_api')}"
nova::compute::force_config_drive: true
NovaCrossAZAttach: false
CinderStorageAvailabilityZone: leaf1
GlanceBackendID: ceph
CeilometerQdrEventsConfig:
driver: amqp
topic: leaf1-event
CeilometerQdrMetricsConfig:
driver: amqp
topic: leaf1-metering
CollectdAmqpInstances:
leaf1-notify:
notify: true
format: JSON
presettle: false
leaf1-telemetry:
format: JSON
presettle: false
CollectdSensubilityResultsChannel: sensubility/leaf1-telemetry
#############################
@Martin - is there a way to verify the consumer on the STF side? Here are the pods and services that have been defined there:
[cjanisze@fedora-vm ~]$ oc get pods -n service-telemetry
NAME READY STATUS RESTARTS AGE
alertmanager-default-0 3/3 Running 0 186d
alertmanager-default-1 3/3 Running 0 186d
central-bbb6c567-xxzcn 1/1 Running 0 55d
default-interconnect-848f695747-dhfdw 1/1 Running 0 6d1h
default-interconnect-848f695747-x2m6t 1/1 Running 0 6d1h
default-leaf1-ceil-event-smartgateway-766694665-6v7js 2/2 Running 8 (6d1h ago) 186d
default-leaf1-ceil-event-smartgateway-766694665-mz6tw 2/2 Running 7 (6d1h ago) 186d
default-leaf1-ceil-meter-smartgateway-7cbd89456-nfjbb 3/3 Running 3108 (19m ago) 186d
default-leaf1-ceil-meter-smartgateway-7cbd89456-xm55r 3/3 Running 3130 (33m ago) 186d
default-leaf1-coll-event-smartgateway-6cbbcdcff4-fzffs 2/2 Running 6 (6d1h ago) 186d
default-leaf1-coll-event-smartgateway-6cbbcdcff4-wzllb 2/2 Running 6 (6d1h ago) 186d
default-leaf1-coll-meter-smartgateway-868d4b9744-hb825 3/3 Running 9 (6d1h ago) 186d
default-leaf1-coll-meter-smartgateway-868d4b9744-wfq5d 3/3 Running 6 (6d1h ago) 186d
default-leaf1-sens-meter-smartgateway-5f6f85457b-88dfz 3/3 Running 6 (6d1h ago) 186d
default-leaf1-sens-meter-smartgateway-5f6f85457b-zxssn 3/3 Running 7 (6d1h ago) 186d
default-leaf2-ceil-event-smartgateway-7f6cf64d66-dlfxg 2/2 Running 8 (6d1h ago) 186d
default-leaf2-ceil-event-smartgateway-7f6cf64d66-wxrq4 2/2 Running 7 (6d1h ago) 186d
default-leaf2-ceil-meter-smartgateway-64bdc557b6-2mdxd 3/3 Running 5 (6d1h ago) 186d
default-leaf2-ceil-meter-smartgateway-64bdc557b6-ktdqn 3/3 Running 6 (6d1h ago) 186d
default-leaf2-coll-event-smartgateway-5859b47dcc-gs7kd 2/2 Running 6 (6d1h ago) 186d
default-leaf2-coll-event-smartgateway-5859b47dcc-hl4ft 2/2 Running 7 (6d1h ago) 186d
default-leaf2-coll-meter-smartgateway-66f88d67c7-jqq6q 3/3 Running 7 (6d1h ago) 186d
default-leaf2-coll-meter-smartgateway-66f88d67c7-ms28c 3/3 Running 8 (6d1h ago) 186d
default-leaf2-sens-meter-smartgateway-f8f88cc99-lxg8l 3/3 Running 7 (6d1h ago) 186d
default-leaf2-sens-meter-smartgateway-f8f88cc99-nnjb4 3/3 Running 5 (6d1h ago) 186d
default-leaf3-ceil-event-smartgateway-5874f8fdcb-8pzt5 2/2 Running 8 (6d1h ago) 186d
default-leaf3-ceil-event-smartgateway-5874f8fdcb-ghtqd 2/2 Running 7 (6d1h ago) 186d
default-leaf3-ceil-meter-smartgateway-5d59f7669c-dvcxh 3/3 Running 8 (6d1h ago) 186d
default-leaf3-ceil-meter-smartgateway-5d59f7669c-ngsvn 3/3 Running 7 (6d1h ago) 186d
default-leaf3-coll-event-smartgateway-57c7cc46d8-frcr2 2/2 Running 4 (6d1h ago) 186d
default-leaf3-coll-event-smartgateway-57c7cc46d8-z2vxg 2/2 Running 6 (6d1h ago) 186d
default-leaf3-coll-meter-smartgateway-6f97db9455-ktsnl 3/3 Running 8 (6d1h ago) 186d
default-leaf3-coll-meter-smartgateway-6f97db9455-ndrgx 3/3 Running 9 (6d1h ago) 186d
default-leaf3-sens-meter-smartgateway-866468459c-6hgg5 3/3 Running 9 (6d1h ago) 186d
default-leaf3-sens-meter-smartgateway-866468459c-zn5d4 3/3 Running 8 (6d1h ago) 186d
elastic-operator-96d99cbff-7zcp9 1/1 Running 0 40d
elasticsearch-es-default-0 1/1 Running 0 98d
elasticsearch-es-default-1 1/1 Running 0 98d
elasticsearch-es-default-2 1/1 Running 0 98d
filebeat-beat-filebeat-75c8f9cd44-gf6z6 1/1 Running 0 152d
grafana-deployment-575f495d4f-zx79d 2/2 Running 0 186d
grafana-operator-controller-manager-58596df79c-s8k8p 2/2 Running 0 186d
interconnect-operator-7c8b5b859-6tzw6 1/1 Running 0 8d
kibana-kb-d4b4589d8-5m549 1/1 Running 0 152d
prometheus-default-0 3/3 Running 1 (186d ago) 186d
prometheus-default-1 3/3 Running 1 (186d ago) 186d
prometheus-operator-b5d479c56-5nvch 1/1 Running 0 186d
scanner-745849d47d-4mb5t 1/1 Running 0 55d
scanner-745849d47d-qlc87 1/1 Running 0 55d
scanner-db-7644569bcc-6f9sj 1/1 Running 0 55d
service-telemetry-operator-5d4d5784df-459sz 1/1 Running 0 148d
smart-gateway-operator-6ffddd594c-4qjrd 1/1 Running 0 186d
[cjanisze@fedora-vm ~]$ oc get svc -n service-telemetry
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 189d
central ClusterIP 172.30.60.212 <none> 443/TCP 104d
default-alertmanager-proxy ClusterIP 172.30.49.177 <none> 9095/TCP 189d
default-interconnect ClusterIP 172.30.45.144 <none> 5672/TCP,8672/TCP,55671/TCP,5671/TCP,5673/TCP 189d
default-leaf1-ceil-meter ClusterIP 172.30.196.1 <none> 8083/TCP 189d
default-leaf1-coll-meter ClusterIP 172.30.115.144 <none> 8083/TCP 189d
default-leaf1-sens-meter ClusterIP 172.30.248.212 <none> 8083/TCP 189d
default-leaf2-ceil-meter ClusterIP 172.30.28.229 <none> 8083/TCP 189d
default-leaf2-coll-meter ClusterIP 172.30.112.173 <none> 8083/TCP 189d
default-leaf2-sens-meter ClusterIP 172.30.183.111 <none> 8083/TCP 189d
default-leaf3-ceil-meter ClusterIP 172.30.245.213 <none> 8083/TCP 189d
default-leaf3-coll-meter ClusterIP 172.30.223.58 <none> 8083/TCP 189d
default-leaf3-sens-meter ClusterIP 172.30.139.244 <none> 8083/TCP 189d
default-prometheus-proxy ClusterIP 172.30.203.240 <none> 9092/TCP 189d
elastic-operator-service ClusterIP 172.30.198.254 <none> 443/TCP 40d
elasticsearch-es-default ClusterIP None <none> 9200/TCP 189d
elasticsearch-es-http ClusterIP 172.30.142.242 <none> 9200/TCP 189d
elasticsearch-es-internal-http ClusterIP 172.30.4.133 <none> 9200/TCP 189d
elasticsearch-es-transport ClusterIP None <none> 9300/TCP 189d
filebeat-service ClusterIP 172.30.255.235 <none> 9000/TCP 160d
grafana-operator-controller-manager-metrics-service ClusterIP 172.30.203.90 <none> 8443/TCP 189d
grafana-service ClusterIP 172.30.192.91 <none> 3000/TCP,3002/TCP 189d
kibana-kb-http ClusterIP 172.30.119.244 <none> 5601/TCP 172d
prometheus-operated ClusterIP None <none> 9090/TCP 189d
scanner ClusterIP 172.30.108.19 <none> 8080/TCP,8443/TCP 104d
scanner-db ClusterIP 172.30.106.17 <none> 5432/TCP 104d
Also we are probably a few days away from getting another oomkill. I am personally not convinced it's anything to do with STF. Here is the current top mem consumers chart:
top - 09:55:09 up 25 days, 21:37, 2 users, load average: 3.20, 2.96, 2.82
Tasks: 1141 total, 1 running, 1140 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.2 us, 0.9 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.2 hi, 0.3 si, 0.0 st
MiB Mem : 127617.1 total, 23697.7 free, 91824.3 used, 12095.1 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 34684.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9370 42409 20 0 4197036 1.8g 11016 S 0.0 1.5 383:05.72 collectd-sensub
8045 42457 20 0 3605320 1.6g 4324 S 0.0 1.3 120:18.04 memcached
8020 42465 20 0 1398060 899100 12612 S 5.2 0.7 769:56.00 qdrouterd
197803 42434 20 0 12.4g 481848 157948 S 1.6 0.4 501:27.60 mariadbd
16439 42439 20 0 6924132 407688 70056 S 28.3 0.3 6824:37 beam.smp
7625 48 20 0 1073548 300892 14252 S 0.0 0.2 3:43.85 httpd
7618 48 20 0 1070476 299120 14304 S 0.0 0.2 3:55.34 httpd
7621 48 20 0 1072012 289956 14304 S 0.0 0.2 3:30.17 httpd
7623 48 20 0 1069924 289212 14304 S 0.0 0.2 4:04.41 httpd
7620 48 20 0 1073548 284772 14304 S 0.0 0.2 3:47.17 httpd
7656 48 20 0 1071244 280428 14304 S 0.0 0.2 4:00.70 httpd
7629 48 20 0 1070732 275864 14304 S 0.0 0.2 3:37.76 httpd
7622 48 20 0 1005964 275812 14304 S 0.0 0.2 3:55.26 httpd
7657 48 20 0 1070220 274980 14300 S 0.0 0.2 3:36.29 httpd
7630 48 20 0 1071244 274408 14304 S 0.0 0.2 3:43.34 httpd
1502 openvsw+ 10 -10 3076060 270788 31112 S 2.0 0.2 949:48.94 ovs-vswitchd
7619 48 20 0 1004428 263492 14304 S 0.0 0.2 3:42.83 httpd
7624 48 20 0 1005708 261104 14304 S 0.0 0.2 3:42.02 httpd
21199 42435 20 0 263728 246312 8656 S 0.3 0.2 1117:22 neutron-server:
21203 42435 20 0 263472 246200 8656 S 0.3 0.2 1102:43 neutron-server:
21202 42435 20 0 261936 244664 8656 S 0.0 0.2 1137:40 neutron-server:
21196 42435 20 0 261168 243896 8656 S 0.0 0.2 1120:36 neutron-server:
21197 42435 20 0 260912 243640 8656 S 0.0 0.2 1164:18 neutron-server:
21206 42435 20 0 260912 243404 8656 S 0.0 0.2 1148:07 neutron-server:
21205 42435 20 0 259632 242204 8656 S 0.0 0.2 1170:57 neutron-server:
21204 42435 20 0 258864 241592 8656 S 1.0 0.2 1104:18 neutron-server:
21212 42435 20 0 245040 224860 5892 S 0.0 0.2 36:54.93 neutron-server:
21214 42435 20 0 244784 224552 5892 S 0.0 0.2 37:32.92 neutron-server:
21208 42435 20 0 243504 223348 5892 S 0.0 0.2 38:20.33 neutron-server:
21215 42435 20 0 243248 223200 5892 S 0.3 0.2 37:57.85 neutron-server:
21213 42435 20 0 243248 223012 5892 S 0.0 0.2 37:20.17 neutron-server:
21209 42435 20 0 242992 222916 5892 S 0.0 0.2 36:51.90 neutron-server:
21211 42435 20 0 242992 222708 5892 S 0.0 0.2 37:46.99 neutron-server:
21210 42435 20 0 242480 222268 5892 S 0.0 0.2 35:42.12 neutron-server:
21217 42435 20 0 241968 221380 5536 S 0.0 0.2 56:46.28 neutron-server:
9427 42405 20 0 5038696 209536 9068 S 13.0 0.2 2982:20 ceilometer-agen
944 root 20 0 343208 202384 199248 S 0.3 0.2 371:29.47 systemd-journal
10115 42415 20 0 1155664 192808 24044 S 0.3 0.1 113:48.79 glance-api
8024 42435 20 0 197940 188288 16464 S 0.0 0.1 18:23.44 /usr/bin/python
21218 42435 20 0 208432 187540 5040 S 0.0 0.1 10:31.41 neutron-server:
21216 42435 20 0 206388 185536 5176 S 0.0 0.1 5:01.65 neutron-server:
3144 root rt 0 577200 183908 67572 S 1.0 0.1 349:36.59 corosync
7459 42407 20 0 731992 179312 31236 S 0.0 0.1 115:33.48 httpd
7460 42407 20 0 734040 178804 31300 S 0.0 0.1 117:26.22 httpd
7452 42407 20 0 731992 176852 31228 S 0.0 0.1 117:09.87 httpd
7455 42407 20 0 732248 176768 31236 S 0.0 0.1 117:07.11 httpd
7456 42407 20 0 731992 176272 31236 S 0.3 0.1 115:57.66 httpd
7453 42407 20 0 734296 173372 31228 S 0.0 0.1 117:52.21 httpd
7451 42407 20 0 731736 172840 31236 S 0.0 0.1 116:18.51 httpd
7458 42407 20 0 732248 171904 31228 S 0.0 0.1 117:12.75 httpd
8292 42436 20 0 281984 164336 16512 S 0.0 0.1 506:25.72 httpd
8339 42436 20 0 282088 164132 16388 S 0.0 0.1 511:45.12 httpd
8340 42436 20 0 282496 162528 16392 S 0.0 0.1 511:59.55 httpd
[root@leaf1-controller-2 ~]# uptime
09:55:15 up 25 days, 21:37, 2 users, load average: 2.95, 2.91, 2.81
Here is another output sorted by VIRT:
top - 09:58:48 up 25 days, 21:40, 2 users, load average: 2.59, 2.65, 2.71
Tasks: 1155 total, 1 running, 1154 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.1 us, 2.8 sy, 0.0 ni, 88.9 id, 0.3 wa, 0.4 hi, 0.5 si, 0.0 st
MiB Mem : 127617.1 total, 23160.5 free, 92345.6 used, 12111.0 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 34163.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
197803 42434 20 0 12.4g 481860 157948 S 1.0 0.4 501:30.54 mariadbd
16439 42439 20 0 6933144 412400 70056 S 31.6 0.3 6825:18 beam.smp
9427 42405 20 0 5038696 209536 9068 S 16.0 0.2 2982:44 ceilometer-agen
9370 42409 20 0 4197036 1.8g 11016 S 0.0 1.5 383:10.08 collectd-sensub
8045 42457 20 0 3605320 1.6g 4324 S 0.0 1.3 120:18.93 memcached
1297 root 20 0 3566900 81744 32960 S 0.0 0.1 83:18.03 podman
7169 root 20 0 3089108 47696 11032 S 2.6 0.0 645:59.37 collectd
1502 openvsw+ 10 -10 3076060 270788 31112 S 2.6 0.2 949:54.76 ovs-vswitchd
3159 polkitd 20 0 2982392 22480 18376 S 0.0 0.0 0:00.93 polkitd
142352 root 20 0 2908944 73920 31744 S 9.8 0.1 0:00.30 podman
142404 root 20 0 2826508 66496 31384 S 9.1 0.1 0:00.28 podman
142600 root 20 0 2458104 65228 31484 S 4.9 0.0 0:00.15 podman
142477 root 20 0 2457592 68444 32252 S 6.2 0.1 0:00.19 podman
142710 root 20 0 2457508 61844 30256 S 6.2 0.0 0:00.19 podman
142434 root 20 0 2456996 66936 31020 S 6.2 0.1 0:00.19 podman
142357 root 20 0 2309360 63768 32088 S 7.5 0.0 0:00.23 podman
142423 root 20 0 2309276 60792 30200 S 6.2 0.0 0:00.19 podman
142709 root 20 0 2236568 61868 30308 S 3.9 0.0 0:00.12 podman
142354 root 20 0 2162920 60820 31588 S 6.8 0.0 0:00.21 podman
142346 root 20 0 2162152 63840 31168 S 7.2 0.0 0:00.22 podman
142375 root 20 0 2014348 61060 29420 S 6.2 0.0 0:00.19 podman
3155 root 20 0 1978756 99800 15012 S 0.0 0.1 64:32.87 pcsd
142419 root 20 0 1940956 61324 31604 S 5.5 0.0 0:00.17 podman
142353 root 20 0 1940444 62500 31668 S 6.5 0.0 0:00.20 podman
142363 root 20 0 1940188 60644 32052 S 6.2 0.0 0:00.19 podman
142460 root 20 0 1866712 59000 31412 S 4.9 0.0 0:00.15 podman
8020 42465 20 0 1398060 899172 12612 S 2.9 0.7 770:02.11 qdrouterd
10115 42415 20 0 1155664 192808 24044 S 0.0 0.1 113:50.35 glance-api
6991 994 20 0 1141328 33512 3800 S 4.2 0.0 2456:07 haproxy
10120 42415 20 0 1131996 156244 24516 S 0.0 0.1 109:06.41 glance-api
60706 42435 20 0 1127560 10508 1272 S 0.0 0.0 0:21.28 haproxy
60942 42435 20 0 1127560 10640 1408 S 0.0 0.0 0:21.62 haproxy
61251 42435 20 0 1127560 10604 1372 S 0.0 0.0 0:21.35 haproxy
61875 42435 20 0 1127560 10496 1264 S 0.0 0.0 0:21.42 haproxy
62118 42435 20 0 1127560 10572 1344 S 0.0 0.0 0:21.54 haproxy
62263 42435 20 0 1127560 8536 1344 S 0.0 0.0 0:21.56 haproxy
62549 42435 20 0 1127560 10640 1408 S 0.0 0.0 0:21.36 haproxy
62899 42435 20 0 1127560 10528 1296 S 0.0 0.0 0:21.52 haproxy
63090 42435 20 0 1127560 10604 1376 S 0.0 0.0 0:21.25 haproxy
10119 42415 20 0 1122524 137244 24072 S 0.0 0.1 105:53.89 glance-api
10110 42415 20 0 1116892 139144 23732 S 0.0 0.1 105:05.50 glance-api
10109 42415 20 0 1113924 155660 24288 S 0.0 0.1 106:45.28 glance-api
7620 48 20 0 1073548 284772 14304 S 0.0 0.2 3:47.18 httpd
7625 48 20 0 1073548 300892 14252 S 0.0 0.2 3:43.86 httpd
7621 48 20 0 1072012 289956 14304 S 0.0 0.2 3:30.17 httpd
7630 48 20 0 1071244 274408 14304 S 0.0 0.2 3:43.35 httpd
7656 48 20 0 1071244 280428 14304 S 0.0 0.2 4:00.71 httpd
7629 48 20 0 1070732 275864 14304 S 0.0 0.2 3:37.77 httpd
7618 48 20 0 1070476 299120 14304 S 0.0 0.2 3:55.34 httpd
7657 48 20 0 1070220 274980 14300 S 0.0 0.2 3:36.30 httpd
7623 48 20 0 1069924 289212 14304 S 0.0 0.2 4:04.42 httpd
10111 42415 20 0 1056476 143328 23732 S 0.3 0.1 98:41.37 glance-api
7622 48 20 0 1005964 275812 14304 S 0.0 0.2 3:55.27 httpd
7624 48 20 0 1005708 261104 14304 S 0.0 0.2 3:42.03 httpd
7619 48 20 0 1004428 263492 14304 S 0.0 0.2 3:42.84 httpd
10105 42415 20 0 990684 134424 21256 S 0.0 0.1 102:04.90 glance-api
and another sorted by RES
top - 09:59:46 up 25 days, 21:41, 2 users, load average: 2.22, 2.52, 2.66
Tasks: 1137 total, 3 running, 1134 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.9 us, 0.8 sy, 0.0 ni, 95.7 id, 0.0 wa, 0.2 hi, 0.4 si, 0.0 st
MiB Mem : 127617.1 total, 23664.7 free, 91839.2 used, 12113.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 34669.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9370 42409 20 0 4197036 1.8g 11016 S 0.0 1.5 383:10.49 collectd-sensub
8045 42457 20 0 3605320 1.6g 4324 S 0.0 1.3 120:19.16 memcached
8020 42465 20 0 1398060 899172 12612 S 4.5 0.7 770:03.75 qdrouterd
197803 42434 20 0 12.4g 481876 157948 S 1.0 0.4 501:31.32 mariadbd
16439 42439 20 0 6935696 408932 70056 S 17.9 0.3 6825:29 beam.smp
7625 48 20 0 1073548 300892 14252 S 0.0 0.2 3:43.86 httpd
7618 48 20 0 1070476 299120 14304 S 0.0 0.2 3:55.35 httpd
7621 48 20 0 1072012 289956 14304 S 0.0 0.2 3:30.18 httpd
7623 48 20 0 1069924 289212 14304 S 0.0 0.2 4:04.42 httpd
7620 48 20 0 1073548 284772 14304 S 0.0 0.2 3:47.18 httpd
7656 48 20 0 1071244 280428 14304 S 0.0 0.2 4:00.71 httpd
7629 48 20 0 1070732 275864 14304 S 0.0 0.2 3:37.77 httpd
7622 48 20 0 1005964 275812 14304 S 0.0 0.2 3:55.27 httpd
7657 48 20 0 1070220 274980 14300 S 0.0 0.2 3:36.30 httpd
7630 48 20 0 1071244 274408 14304 S 0.0 0.2 3:43.35 httpd
1502 openvsw+ 10 -10 3076060 270788 31112 S 2.9 0.2 949:56.27 ovs-vswitchd
7619 48 20 0 1004428 263492 14304 S 0.0 0.2 3:42.84 httpd
7624 48 20 0 1005708 261104 14304 S 0.0 0.2 3:42.03 httpd
21199 42435 20 0 263728 246312 8656 S 4.5 0.2 1117:35 neutron-server:
21203 42435 20 0 263472 246200 8656 S 0.0 0.2 1102:52 neutron-server:
21202 42435 20 0 261936 244664 8656 S 16.6 0.2 1137:50 neutron-server:
21196 42435 20 0 261168 243896 8656 S 0.3 0.2 1120:46 neutron-server:
21197 42435 20 0 260912 243640 8656 S 0.0 0.2 1164:24 neutron-server:
21206 42435 20 0 260912 243404 8656 S 0.3 0.2 1148:17 neutron-server:
21205 42435 20 0 259632 242204 8656 S 0.3 0.2 1171:04 neutron-server:
21204 42435 20 0 258864 241592 8656 S 0.3 0.2 1104:23 neutron-server:
21212 42435 20 0 245040 224860 5892 S 0.0 0.2 36:55.16 neutron-server:
21214 42435 20 0 244784 224552 5892 S 0.0 0.2 37:33.18 neutron-server:
21208 42435 20 0 243504 223348 5892 S 0.0 0.2 38:20.55 neutron-server:
21215 42435 20 0 243248 223200 5892 S 0.0 0.2 37:58.07 neutron-server:
21213 42435 20 0 243248 223012 5892 S 0.0 0.2 37:20.38 neutron-server:
21209 42435 20 0 242992 222916 5892 S 0.0 0.2 36:52.13 neutron-server:
21211 42435 20 0 242992 222708 5892 S 0.0 0.2 37:47.21 neutron-server:
21210 42435 20 0 242480 222268 5892 S 0.3 0.2 35:42.35 neutron-server:
21217 42435 20 0 241968 221380 5536 S 0.0 0.2 56:46.38 neutron-server:
944 root 20 0 359592 217956 214820 S 0.3 0.2 371:32.55 systemd-journal
9427 42405 20 0 5038696 209536 9068 S 15.9 0.2 2982:50 ceilometer-agen
10115 42415 20 0 1155664 192808 24044 S 0.0 0.1 113:50.46 glance-api
8024 42435 20 0 197940 188288 16464 S 0.0 0.1 18:23.58 /usr/bin/python
21218 42435 20 0 208432 187540 5040 S 0.0 0.1 10:31.49 neutron-server:
21216 42435 20 0 206388 185536 5176 S 0.0 0.1 5:01.68 neutron-server:
3144 root rt 0 577200 183908 67572 S 1.0 0.1 349:39.09 corosync
7459 42407 20 0 731992 179312 31236 S 0.0 0.1 115:34.63 httpd
7460 42407 20 0 734040 178804 31300 S 0.0 0.1 117:26.49 httpd
7452 42407 20 0 731992 176852 31228 S 0.3 0.1 117:10.88 httpd
7455 42407 20 0 732248 176768 31236 S 0.0 0.1 117:07.42 httpd
7456 42407 20 0 731992 176272 31236 S 0.0 0.1 115:59.28 httpd
7453 42407 20 0 734296 173372 31228 S 0.3 0.1 117:53.31 httpd
7451 42407 20 0 731736 172840 31236 S 0.0 0.1 116:18.79 httpd
7458 42407 20 0 732248 171904 31228 S 0.0 0.1 117:14.47 httpd
8292 42436 20 0 281984 164336 16512 S 1.0 0.1 506:30.43 httpd
8339 42436 20 0 282088 164132 16388 S 3.2 0.1 511:50.09 httpd
8340 42436 20 0 282496 162528 16392 S 0.3 0.1 512:04.63 httpd
8290 42436 20 0 281064 162520 16396 S 2.6 0.1 508:41.25 httpd
8293 42436 20 0 280808 162464 16396 S 0.3 0.1 513:27.04 httpd
8337 42436 20 0 281216 161788 16520 S 0.0 0.1 507:58.67 httpd
I have 128GB of RAM on these controllers and it take rougly 1 month for them to run out of it, but I don't see any single process to blame.
[root@leaf1-controller-2 ~]# cat /var/lib/config-data/puppet-generated/collectd/etc/collectd.d/10-amqp1.conf
# Generated by Puppet
<LoadPlugin amqp1>
Globals false
Interval 5
</LoadPlugin>
<Plugin amqp1>
<Transport "metrics">
Host "172.20.12.137"
Port "5666"
User "guest"
Password "guest"
Address "collectd"
RetryDelay 1
SendQueueLimit 5000
<Instance "leaf1-notify">
Format "JSON"
Notify true
PreSettle false
</Instance>
<Instance "leaf1-telemetry">
Format "JSON"
PreSettle false
</Instance>
</Transport>
</Plugin>
Please update to 17.1 where sensubility has been updated. |