Created attachment 1485083 [details] Octavia worker logs Description of problem: On all the Openshift on Openstack deployments, it happens after some time that the controller node runs out of RAM memory and the Openshift cluster becomes unmanageable. At this point I've noticed that restarting the Octavia health manager container does release gigas of memory, and makes the Openshift cluster manageable again. Version-Release number of selected component (if applicable): OSP 13 puddle 2018-09-13.1 rhosp13/openstack-octavia-health-manager:2018-09-13.1 openstack-octavia-common-2.0.1-6.d137eaagit.el7ost.noarch openstack-octavia-health-manager-2.0.1-6.d137eaagit.el7ost.noarch puppet-octavia-12.4.0-2.el7ost.noarch python-octavia-2.0.1-6.d137eaagit.el7ost.noarch How reproducible: Always Steps to Reproduce: 1. Install OSP 13 (with Octavia) 2. Install OCP (3.11 in this case, also seen with 3.10) 3. Monitorize free memory in the controller node Actual results: [heat-admin@controller-0 ~]$ date; free -h mié sep 19 16:12:57 UTC 2018 total used free shared buff/cache available Mem: 27G 20G 3,8G 69M 3,3G 6,2G Swap: 0B 0B 0B (shiftstack) [cloud-user@ansible-host-0 ~]$ openstack loadbalancer list +--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+ | cc4a377e-c4dd-4010-8974-5e999fd595a6 | openshift-cluster-router_lb-ca37ddhjjrvx | 3a88df4f83fa44ffada1c3b03e17327b | 192.168.99.12 | ACTIVE | octavia | | e77f6b07-2c8e-4562-a383-898da945c9b7 | openshift-ansible-openshift.example.com-api-lb | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.0.1 | ACTIVE | octavia | | 986a3e27-1e6a-4d32-8776-9cbe3935a278 | default/router | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.7.21 | ACTIVE | octavia | | 05c98423-fa11-489e-8760-6d6e7b18c77a | default/docker-registry | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.103.68 | ACTIVE | octavia | | 6f34d87b-966a-4bee-bc05-9e7953d50250 | default/registry-console | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.86.138 | ACTIVE | octavia | | c1a75615-86d7-43d4-9f38-2f24df9857a8 | openshift-web-console/webconsole | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.211.2 | ACTIVE | octavia | | 065a9324-042a-48a3-bf3b-658efe0c7785 | openshift-console/console | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.203.144 | ACTIVE | octavia | | d7ff2739-d11a-4bc1-a3b5-524f8375f262 | openshift-monitoring/grafana | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.25.98 | ACTIVE | octavia | | aee71255-c8cd-4761-9686-0ab5f6261c83 | openshift-monitoring/prometheus-k8s | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.204.230 | ACTIVE | octavia | | 073a1b83-49ba-49f0-a1ed-fc32627b0094 | openshift-monitoring/alertmanager-main | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.222.240 | ACTIVE | octavia | +--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+ 2'5 hours later.... [heat-admin@controller-0 ~]$ date; free -h mié sep 19 18:55:49 UTC 2018 total used free shared buff/cache available Mem: 27G 24G 281M 69M 2,6G 1,9G Swap: 0B 0B 0B PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 928393 42437 20 0 15,8g 395200 2800 S 46,7 1,4 51:34.89 octavia-health- 928401 42437 20 0 14,8g 355512 2792 S 46,7 1,2 45:56.55 octavia-health- 928395 42437 20 0 14,4g 337816 2796 S 46,1 1,2 41:41.35 octavia-health- 928391 42437 20 0 14,0g 323420 2756 S 45,4 1,1 35:10.66 octavia-health- 928394 42437 20 0 15,1g 367564 2760 S 42,8 1,3 51:27.73 octavia-health- 928397 42437 20 0 14,5g 342540 2760 S 42,8 1,2 43:09.24 octavia-health- 928396 42437 20 0 13,7g 311404 2756 S 37,2 1,1 34:35.23 octavia-health- 928403 42437 20 0 14,6g 349512 2748 S 36,2 1,2 45:46.38 octavia-health- 928398 42437 20 0 14,4g 342840 2760 S 35,9 1,2 43:41.51 octavia-health- 928400 42437 20 0 14,3g 337036 2760 S 35,2 1,2 42:10.24 octavia-health- 928402 42437 20 0 14,1g 329712 2760 S 35,2 1,2 37:37.48 octavia-health- 928404 42437 20 0 13,5g 304624 2752 S 35,2 1,1 30:42.82 octavia-health- 928405 42437 20 0 14,4g 341424 2792 S 34,9 1,2 42:48.73 octavia-health- 928406 42437 20 0 14,1g 327092 2760 S 32,9 1,1 38:02.96 octavia-health- 928390 42437 20 0 13,6g 308660 2756 S 30,6 1,1 33:47.69 octavia-health- 928399 42437 20 0 13,2g 294656 2788 S 28,6 1,0 29:15.33 octavia-health- 111032 42425 20 0 760196 103812 7032 S 19,4 0,4 11:01.42 httpd 44920 42439 20 0 44,8g 1,6g 4032 S 18,1 5,8 419:46.94 beam.smp 112802 42435 20 0 386044 151400 2268 S 13,2 0,5 9:06.11 neutron-server [heat-admin@controller-0 ~]$ cat /var/log/containers/octavia/health-manager.log 2018-09-19 14:42:18.662 1 INFO octavia.cmd.health_manager [-] Health Manager exiting due to signal 2018-09-19 14:42:19.514 23 INFO octavia.cmd.health_manager [-] Waiting for executor to shutdown... 2018-09-19 14:42:19.834 23 INFO octavia.cmd.health_manager [-] Executor shutdown finished. 2018-09-19 14:42:21.950 1 INFO octavia.common.config [-] Logging enabled! 2018-09-19 14:42:21.951 1 INFO octavia.common.config [-] /usr/bin/octavia-health-manager version 2.0.2.dev42 2018-09-19 14:42:21.953 1 INFO octavia.cmd.health_manager [-] Health Manager listener process starts: 2018-09-19 14:42:21.957 1 INFO octavia.cmd.health_manager [-] Health manager check process starts: 2018-09-19 14:42:21.959 23 INFO octavia.amphorae.drivers.health.heartbeat_udp [-] attempting to listen on 172.24.0.12 port 5555 2018-09-19 15:03:23.329 30 WARNING octavia.controller.healthmanager.health_drivers.update_db [-] Listener be227f7b-505f-4beb-bbfd-4017f5f26350 reported status of DOWN 2018-09-19 15:38:52.833 38 WARNING octavia.common.stats [-] Listener Statistics for Listener 7f2a94d9-9eba-422a-ac33-02024e9c6922 was not found 2018-09-19 15:39:02.773 32 WARNING octavia.common.stats [-] Listener Statistics for Listener a6c968c0-3176-4949-918d-5694fadce870 was not found 2018-09-19 15:39:02.886 32 WARNING octavia.common.stats [-] Listener Statistics for Listener a6c968c0-3176-4949-918d-5694fadce870 was not found 2018-09-19 15:52:07.487 42 WARNING octavia.controller.healthmanager.health_drivers.update_db [-] Listener c3d50df0-79b9-4867-8144-b9e8c2fe7d2e reported status of DOWN Find attached octavia worker logs. [heat-admin@controller-0 ~]$ sudo docker ps | grep octavia c810cc8b73d7 192.168.24.1:8787/rhosp13/openstack-octavia-health-manager:2018-09-13.1 "kolla_start" 2 days ago Up 18 hours (healthy) octavia_health_manager 9b188e45f5a5 192.168.24.1:8787/rhosp13/openstack-octavia-api:2018-09-13.1 "kolla_start" 2 days ago Up 2 days (healthy) octavia_api db0bccae40b6 192.168.24.1:8787/rhosp13/openstack-octavia-housekeeping:2018-09-13.1 "kolla_start" 2 days ago Up 2 days (healthy) octavia_housekeeping 29ad208ab150 192.168.24.1:8787/rhosp13/openstack-octavia-worker:2018-09-13.1 "kolla_start" 2 days ago Up 18 hours (unhealthy) octavia_worker [heat-admin@controller-0 ~]$ free -h total used free shared buff/cache available Mem: 27G 24G 476M 69M 2,3G 1,8G Swap: 0B 0B 0B [heat-admin@controller-0 ~]$ sudo docker restart c810cc8b73d7 c810cc8b73d7 [heat-admin@controller-0 ~]$ free -h total used free shared buff/cache available Mem: 27G 17G 8,0G 70M 2,1G 9,3G Swap: 0B 0B 0B Expected results: The free memory in the controller should remain stable or not decrease so much.
1) How many load balancers do you create? 2) Is it the case that you simply create X load balancers and then wait some time? How much time? During that time can you confirm that you do *not* invoke the Octavia API?
(In reply to Assaf Muller from comment #1) > 1) How many load balancers do you create? In this specific case the Openshift-ansible 3.11 deployed 10 load balancers (find 'openstack loadbalancer list' output in the BZ description). It depends on the configuration and OCP version, but I've seen the issue also with only one load balancer. Always using the Openshift-ansible playbooks for Openstack. > 2) Is it the case that you simply create X load balancers and then wait some > time? How much time? During that time can you confirm that you do *not* > invoke the Octavia API? The Openshift-ansible installer itself deploys all the resources for the Openshift cluster, including VMs, networks, routers, etc. and one load balancer per each service in Openshift. In this specific case it took 2'5 hours to leave the controller without memory, I guess it was so quick due to the high number or load balancers. With less load balancers I usually notice the performance degradation the day after the installation. During that time I usually launch 'openstack loadbalancer list' command several times but don't know how many times Octavia API is invoked internally.
Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1618772 ? Could you please set [health_manager]/event_streamer_driver option to noop_event_streamer and retry? Thanks.
(In reply to Carlos Goncalves from comment #3) > Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1618772 ? > > Could you please set [health_manager]/event_streamer_driver option to > noop_event_streamer and retry? Thanks. The Openshift cluster has been running for 20 hours, with the event_streamer_driver option set to noop_event_streamer, and the memory in the controller has decreased "only" 1.2GB, which is not too much compared with the previous results. I'm pretty sure that with the previous event_streamer_driver value the controller would probably have run out of memory. Let me know which additional info can I add for debugging the issue, or if you would like to connect to the cluster.
(In reply to Jon Uriarte from comment #4) > (In reply to Carlos Goncalves from comment #3) > > Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1618772 ? > > > > Could you please set [health_manager]/event_streamer_driver option to > > noop_event_streamer and retry? Thanks. > > The Openshift cluster has been running for 20 hours, with the > event_streamer_driver option set to noop_event_streamer, and the memory in > the controller has decreased "only" 1.2GB, which is not too much compared > with the previous results. > I'm pretty sure that with the previous event_streamer_driver value the > controller would probably have run out of memory. > > Let me know which additional info can I add for debugging the issue, or if > you would like to connect to the cluster. Hey, Does the memory consumption keep growing with the proposal Carlos provided? Nir
(In reply to Nir Magnezi from comment #5) > (In reply to Jon Uriarte from comment #4) > > (In reply to Carlos Goncalves from comment #3) > > > Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1618772 ? > > > > > > Could you please set [health_manager]/event_streamer_driver option to > > > noop_event_streamer and retry? Thanks. > > > > The Openshift cluster has been running for 20 hours, with the > > event_streamer_driver option set to noop_event_streamer, and the memory in > > the controller has decreased "only" 1.2GB, which is not too much compared > > with the previous results. > > I'm pretty sure that with the previous event_streamer_driver value the > > controller would probably have run out of memory. > > > > Let me know which additional info can I add for debugging the issue, or if > > you would like to connect to the cluster. > > Hey, > > Does the memory consumption keep growing with the proposal Carlos provided? > > Nir Hi Nir, The free memory in the controller has decreased 2,7GB since the Openshift cluster was deployed 44 hours ago. It's actually an improvement compared to the consumption with the previous event_streamer_driver value. I think it's still consuming more and more memory each day, but I would need to see if it stabilizes or keeps increasing, the memory usage increment could be coming now from a different component. Right after the deployment: [heat-admin@controller-0 ~]$ free -h total used free shared buff/cache available Mem: 27G 20G 4,9G 69M 2,1G 5,9G Swap: 0B 0B 0B 20 hours later (1,2G less): Mem: 27G 20G 3,7G 69M 3,0G 5,9G Swap: 0B 0B 0B 44 hours later (2,7G less): [heat-admin@controller-0 ~]$ free -h total used free shared buff/cache available Mem: 27G 20G 2,2G 69M 4,4G 5,9G Swap: 0B 0B 0B docker stats for the health-manager container say that it's consuming 1.087 GiB [heat-admin@controller-0 ~]$ sudo docker stats --no-stream 3ca9ad6b14b4 CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 3ca9ad6b14b4 21.12% 1.087 GiB / 27.32 GiB 3.98% 0 B / 0 B 0 B / 0 B 38 I haven't checked this stats before, but will add them in the next updates. At this point, restarting the health-manager container releases 0,6G in the controller: [heat-admin@controller-0 ~]$ free -h total used free shared buff/cache available Mem: 27G 20G 2,8G 69M 4,5G 6,5G Swap: 0B 0B 0B [heat-admin@controller-0 ~]$ sudo docker stats --no-stream 3ca9ad6b14b4 CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 3ca9ad6b14b4 10.06% 494.3 MiB / 27.32 GiB 1.77% 0 B / 0 B 0 B / 0 B 38 After 10 minutes it goes to the previous values: [heat-admin@controller-0 ~]$ free -h total used free shared buff/cache available Mem: 27G 20G 2,2G 69M 4,5G 5,9G Swap: 0B 0B 0B [heat-admin@controller-0 ~]$ sudo docker stats --no-stream 3ca9ad6b14b4 CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 3ca9ad6b14b4 6.41% 1.032 GiB / 27.32 GiB 3.78% 0 B / 0 B 0 B / 0 B 38 Is this memory usage the expected one? I would need to check the consumption during the coming days to confirm that it doesn't keep growing.
Adding more info: I checked the openstack-octavia-health-manager container memory usage with different event_streamer_driver values. ======================================= Check time: 3 hours event_streamer_driver: queue_event_streamer TIME CPU %* MEM USAGE* FREE MEM** -------- ------- -------- -------- 11:48:06 38.64% 686 MiB 4,0G 12:20:36 73.44% 1.597 GiB 3,0G 12:49:59 148.79% 2.079 GiB 2,2G 13:20:24 219.19% 2.538 GiB 1,5G 13:50:15 333.64% 2.991 GiB 847M 14:24:51 394.27% 3.511 GiB 316M 14:40:16 406.43% 3.755 GiB 304M The container memory consumption increases until it leaves the controller without memory. If the container is restarted it releases almost 4 GB of memory. ======================================= Check time: 2.5 days event_streamer_driver: noop_event_streamer TIME CPU %* MEM USAGE* FREE MEM** -------- ------- -------- -------- day 0 21.40% 771.6 MiB 4,6G day 0.5 0.50% 1.093 GiB 3,8G day 1 12.66% 1.095 GiB 3,3G day 1.5 16.52% 1.098 GiB 2,9G day 2 47.48% 1.096 GiB 4,0G day 2.5 12.19% 1.096 GiB 3,5G The container memory consumption does not increase now, it remains constant in around 1 GB. The free memory in the controller does decrease though, at a lower speed than before, it seems it is a different component which is taken that memory. During the test some memory was released in the controller (the free memory does increase in the day 2), but it's unrelated to the test purposes. ======================================= * CPU % and MEM USAGE are taken with 'docker stats --no-stream <container>' command ** FREE MEM is taken with 'free-h' command
Jon, thanks a lot for investigating and sharing concise results! So, given that memory usage remains constant without the event streamer which we knew to be problematic (CPU and now memory), are you satisfied with the results after setting event streamer to noop? If so, we can close this rhbz as duplicate of rhbz #1618772.
(In reply to Carlos Goncalves from comment #9) > Jon, thanks a lot for investigating and sharing concise results! > > So, given that memory usage remains constant without the event streamer > which we knew to be problematic (CPU and now memory), are you satisfied with > the results after setting event streamer to noop? If so, we can close this > rhbz as duplicate of rhbz #1618772. It's fine with me as soon as the event streamer is not necessary. If you think it does not affect at all it's ok removing it. By the way, is this parameter configurable via TripleO? or are you going to change the default value? If so, this BZ could be re-named and used for that.
It is configurable now via THT, yes. OSP 13: https://bugzilla.redhat.com/show_bug.cgi?id=1624037 OSP 14: https://bugzilla.redhat.com/show_bug.cgi?id=1618772 Upstream: https://review.openstack.org/#/q/e066722d27672c6d781c71b6a07acef50b9821a6 *** This bug has been marked as a duplicate of bug 1618772 ***
This comment was flagged a spam, view the edit history to see the original text if required.
It's fine with me as soon as the event streamer is not necessary. <a href="https://google.com">google</a> If you think it does not affect at all it's ok removing it. By the way, is this parameter configurable via TripleO? or are you going to change the default value?