Bug 1631234 - Octavia health manager container high memory consumption
Summary: Octavia health manager container high memory consumption
Keywords:
Status: CLOSED DUPLICATE of bug 1618772
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-octavia
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Assaf Muller
QA Contact: Alexander Stafeyev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-20 09:25 UTC by Jon Uriarte
Modified: 2025-01-27 13:46 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-08 16:10:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Octavia worker logs (30.91 KB, text/plain)
2018-09-20 09:25 UTC, Jon Uriarte
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-13660 0 None None None 2022-03-13 15:45:15 UTC

Description Jon Uriarte 2018-09-20 09:25:46 UTC
Created attachment 1485083 [details]
Octavia worker logs

Description of problem:

On all the Openshift on Openstack deployments, it happens after some time that the controller node runs out of RAM memory and the Openshift cluster becomes unmanageable.
At this point I've noticed that restarting the Octavia health manager container does release gigas of memory, and makes the Openshift cluster manageable again.


Version-Release number of selected component (if applicable):

OSP 13 puddle 2018-09-13.1

rhosp13/openstack-octavia-health-manager:2018-09-13.1

openstack-octavia-common-2.0.1-6.d137eaagit.el7ost.noarch
openstack-octavia-health-manager-2.0.1-6.d137eaagit.el7ost.noarch
puppet-octavia-12.4.0-2.el7ost.noarch
python-octavia-2.0.1-6.d137eaagit.el7ost.noarch

How reproducible: Always


Steps to Reproduce:
1. Install OSP 13 (with Octavia)
2. Install OCP (3.11 in this case, also seen with 3.10)
3. Monitorize free memory in the controller node

Actual results:

[heat-admin@controller-0 ~]$ date; free -h
mié sep 19 16:12:57 UTC 2018
              total        used        free      shared  buff/cache   available
Mem:            27G         20G        3,8G         69M        3,3G        6,2G
Swap:            0B          0B          0B

(shiftstack) [cloud-user@ansible-host-0 ~]$ openstack loadbalancer list
+--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+
| id                                   | name                                           | project_id                       | vip_address    | provisioning_status | provider |
+--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+
| cc4a377e-c4dd-4010-8974-5e999fd595a6 | openshift-cluster-router_lb-ca37ddhjjrvx       | 3a88df4f83fa44ffada1c3b03e17327b | 192.168.99.12  | ACTIVE              | octavia  |
| e77f6b07-2c8e-4562-a383-898da945c9b7 | openshift-ansible-openshift.example.com-api-lb | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.0.1     | ACTIVE              | octavia  |
| 986a3e27-1e6a-4d32-8776-9cbe3935a278 | default/router                                 | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.7.21    | ACTIVE              | octavia  |
| 05c98423-fa11-489e-8760-6d6e7b18c77a | default/docker-registry                        | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.103.68  | ACTIVE              | octavia  |
| 6f34d87b-966a-4bee-bc05-9e7953d50250 | default/registry-console                       | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.86.138  | ACTIVE              | octavia  |
| c1a75615-86d7-43d4-9f38-2f24df9857a8 | openshift-web-console/webconsole               | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.211.2   | ACTIVE              | octavia  |
| 065a9324-042a-48a3-bf3b-658efe0c7785 | openshift-console/console                      | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.203.144 | ACTIVE              | octavia  |
| d7ff2739-d11a-4bc1-a3b5-524f8375f262 | openshift-monitoring/grafana                   | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.25.98   | ACTIVE              | octavia  |
| aee71255-c8cd-4761-9686-0ab5f6261c83 | openshift-monitoring/prometheus-k8s            | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.204.230 | ACTIVE              | octavia  |
| 073a1b83-49ba-49f0-a1ed-fc32627b0094 | openshift-monitoring/alertmanager-main         | 3a88df4f83fa44ffada1c3b03e17327b | 172.30.222.240 | ACTIVE              | octavia  |
+--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+

2'5 hours later....

[heat-admin@controller-0 ~]$ date; free -h
mié sep 19 18:55:49 UTC 2018
              total        used        free      shared  buff/cache   available
Mem:            27G         24G        281M         69M        2,6G        1,9G
Swap:            0B          0B          0B


    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                     
 928393 42437     20   0   15,8g 395200   2800 S  46,7  1,4  51:34.89 octavia-health-                                                                                                                             
 928401 42437     20   0   14,8g 355512   2792 S  46,7  1,2  45:56.55 octavia-health-                                                                                                                             
 928395 42437     20   0   14,4g 337816   2796 S  46,1  1,2  41:41.35 octavia-health-                                                                                                                             
 928391 42437     20   0   14,0g 323420   2756 S  45,4  1,1  35:10.66 octavia-health-                                                                                                                             
 928394 42437     20   0   15,1g 367564   2760 S  42,8  1,3  51:27.73 octavia-health-                                                                                                                             
 928397 42437     20   0   14,5g 342540   2760 S  42,8  1,2  43:09.24 octavia-health-                                                                                                                             
 928396 42437     20   0   13,7g 311404   2756 S  37,2  1,1  34:35.23 octavia-health-                                                                                                                             
 928403 42437     20   0   14,6g 349512   2748 S  36,2  1,2  45:46.38 octavia-health-                                                                                                                             
 928398 42437     20   0   14,4g 342840   2760 S  35,9  1,2  43:41.51 octavia-health-                                                                                                                             
 928400 42437     20   0   14,3g 337036   2760 S  35,2  1,2  42:10.24 octavia-health-                                                                                                                             
 928402 42437     20   0   14,1g 329712   2760 S  35,2  1,2  37:37.48 octavia-health-                                                                                                                             
 928404 42437     20   0   13,5g 304624   2752 S  35,2  1,1  30:42.82 octavia-health-                                                                                                                             
 928405 42437     20   0   14,4g 341424   2792 S  34,9  1,2  42:48.73 octavia-health-                                                                                                                             
 928406 42437     20   0   14,1g 327092   2760 S  32,9  1,1  38:02.96 octavia-health-                                                                                                                             
 928390 42437     20   0   13,6g 308660   2756 S  30,6  1,1  33:47.69 octavia-health-                                                                                                                             
 928399 42437     20   0   13,2g 294656   2788 S  28,6  1,0  29:15.33 octavia-health-                                                                                                                             
 111032 42425     20   0  760196 103812   7032 S  19,4  0,4  11:01.42 httpd                                                                                                                                       
  44920 42439     20   0   44,8g   1,6g   4032 S  18,1  5,8 419:46.94 beam.smp                                                                                                                                    
 112802 42435     20   0  386044 151400   2268 S  13,2  0,5   9:06.11 neutron-server                  


[heat-admin@controller-0 ~]$ cat /var/log/containers/octavia/health-manager.log
2018-09-19 14:42:18.662 1 INFO octavia.cmd.health_manager [-] Health Manager exiting due to signal
2018-09-19 14:42:19.514 23 INFO octavia.cmd.health_manager [-] Waiting for executor to shutdown...
2018-09-19 14:42:19.834 23 INFO octavia.cmd.health_manager [-] Executor shutdown finished.
2018-09-19 14:42:21.950 1 INFO octavia.common.config [-] Logging enabled!
2018-09-19 14:42:21.951 1 INFO octavia.common.config [-] /usr/bin/octavia-health-manager version 2.0.2.dev42
2018-09-19 14:42:21.953 1 INFO octavia.cmd.health_manager [-] Health Manager listener process starts:
2018-09-19 14:42:21.957 1 INFO octavia.cmd.health_manager [-] Health manager check process starts:
2018-09-19 14:42:21.959 23 INFO octavia.amphorae.drivers.health.heartbeat_udp [-] attempting to listen on 172.24.0.12 port 5555
2018-09-19 15:03:23.329 30 WARNING octavia.controller.healthmanager.health_drivers.update_db [-] Listener be227f7b-505f-4beb-bbfd-4017f5f26350 reported status of DOWN
2018-09-19 15:38:52.833 38 WARNING octavia.common.stats [-] Listener Statistics for Listener 7f2a94d9-9eba-422a-ac33-02024e9c6922 was not found
2018-09-19 15:39:02.773 32 WARNING octavia.common.stats [-] Listener Statistics for Listener a6c968c0-3176-4949-918d-5694fadce870 was not found
2018-09-19 15:39:02.886 32 WARNING octavia.common.stats [-] Listener Statistics for Listener a6c968c0-3176-4949-918d-5694fadce870 was not found
2018-09-19 15:52:07.487 42 WARNING octavia.controller.healthmanager.health_drivers.update_db [-] Listener c3d50df0-79b9-4867-8144-b9e8c2fe7d2e reported status of DOWN

Find attached octavia worker logs.

[heat-admin@controller-0 ~]$ sudo docker ps | grep octavia                                                                                                                                                         
c810cc8b73d7        192.168.24.1:8787/rhosp13/openstack-octavia-health-manager:2018-09-13.1      "kolla_start"            2 days ago          Up 18 hours (healthy)                         octavia_health_manager
9b188e45f5a5        192.168.24.1:8787/rhosp13/openstack-octavia-api:2018-09-13.1                 "kolla_start"            2 days ago          Up 2 days (healthy)                           octavia_api
db0bccae40b6        192.168.24.1:8787/rhosp13/openstack-octavia-housekeeping:2018-09-13.1        "kolla_start"            2 days ago          Up 2 days (healthy)                           octavia_housekeeping
29ad208ab150        192.168.24.1:8787/rhosp13/openstack-octavia-worker:2018-09-13.1              "kolla_start"            2 days ago          Up 18 hours (unhealthy)                       octavia_worker

[heat-admin@controller-0 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:            27G         24G        476M         69M        2,3G        1,8G
Swap:            0B          0B          0B

[heat-admin@controller-0 ~]$ sudo docker restart c810cc8b73d7                                                                                                                                                      
c810cc8b73d7

[heat-admin@controller-0 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:            27G         17G        8,0G         70M        2,1G        9,3G
Swap:            0B          0B          0B



Expected results: The free memory in the controller should remain stable or not decrease so much.

Comment 1 Assaf Muller 2018-09-20 11:36:08 UTC
1) How many load balancers do you create?
2) Is it the case that you simply create X load balancers and then wait some time? How much time? During that time can you confirm that you do *not* invoke the Octavia API?

Comment 2 Jon Uriarte 2018-09-20 14:09:39 UTC
(In reply to Assaf Muller from comment #1)
> 1) How many load balancers do you create?

In this specific case the Openshift-ansible 3.11 deployed 10 load balancers (find 'openstack loadbalancer list' output in the BZ description). It depends on the configuration and OCP version, but I've seen the issue also with only one load balancer. Always using the Openshift-ansible playbooks for Openstack.

> 2) Is it the case that you simply create X load balancers and then wait some
> time? How much time? During that time can you confirm that you do *not*
> invoke the Octavia API?

The Openshift-ansible installer itself deploys all the resources for the Openshift cluster, including VMs, networks, routers, etc. and one load balancer per each service in Openshift.

In this specific case it took 2'5 hours to leave the controller without memory, I guess it was so quick due to the high number or load balancers. With less load balancers I usually notice the performance degradation the day after the installation.

During that time I usually launch 'openstack loadbalancer list' command several times but don't know how many times Octavia API is invoked internally.

Comment 3 Carlos Goncalves 2018-10-01 10:58:32 UTC
Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1618772  ?

Could you please set [health_manager]/event_streamer_driver option to noop_event_streamer and retry? Thanks.

Comment 4 Jon Uriarte 2018-10-02 10:58:00 UTC
(In reply to Carlos Goncalves from comment #3)
> Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1618772  ?
> 
> Could you please set [health_manager]/event_streamer_driver option to
> noop_event_streamer and retry? Thanks.

The Openshift cluster has been running for 20 hours, with the event_streamer_driver option set to noop_event_streamer, and the memory in the controller has decreased "only" 1.2GB, which is not too much compared with the previous results.
I'm pretty sure that with the previous event_streamer_driver value the controller would probably have run out of memory.

Let me know which additional info can I add for debugging the issue, or if you would like to connect to the cluster.

Comment 5 Nir Magnezi 2018-10-03 13:55:27 UTC
(In reply to Jon Uriarte from comment #4)
> (In reply to Carlos Goncalves from comment #3)
> > Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1618772  ?
> > 
> > Could you please set [health_manager]/event_streamer_driver option to
> > noop_event_streamer and retry? Thanks.
> 
> The Openshift cluster has been running for 20 hours, with the
> event_streamer_driver option set to noop_event_streamer, and the memory in
> the controller has decreased "only" 1.2GB, which is not too much compared
> with the previous results.
> I'm pretty sure that with the previous event_streamer_driver value the
> controller would probably have run out of memory.
> 
> Let me know which additional info can I add for debugging the issue, or if
> you would like to connect to the cluster.

Hey,

Does the memory consumption keep growing with the proposal Carlos provided?

Nir

Comment 6 Jon Uriarte 2018-10-03 15:16:33 UTC
(In reply to Nir Magnezi from comment #5)
> (In reply to Jon Uriarte from comment #4)
> > (In reply to Carlos Goncalves from comment #3)
> > > Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1618772  ?
> > > 
> > > Could you please set [health_manager]/event_streamer_driver option to
> > > noop_event_streamer and retry? Thanks.
> > 
> > The Openshift cluster has been running for 20 hours, with the
> > event_streamer_driver option set to noop_event_streamer, and the memory in
> > the controller has decreased "only" 1.2GB, which is not too much compared
> > with the previous results.
> > I'm pretty sure that with the previous event_streamer_driver value the
> > controller would probably have run out of memory.
> > 
> > Let me know which additional info can I add for debugging the issue, or if
> > you would like to connect to the cluster.
> 
> Hey,
> 
> Does the memory consumption keep growing with the proposal Carlos provided?
> 
> Nir

Hi Nir,

The free memory in the controller has decreased 2,7GB since the Openshift cluster was deployed 44 hours ago.
It's actually an improvement compared to the consumption with the previous event_streamer_driver value.
I think it's still consuming more and more memory each day, but I would need to see if it stabilizes or keeps
increasing, the memory usage increment could be coming now from a different component.

Right after the deployment:
[heat-admin@controller-0 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:            27G         20G        4,9G         69M        2,1G        5,9G
Swap:            0B          0B          0B

20 hours later (1,2G less):
Mem:            27G         20G        3,7G         69M        3,0G        5,9G
Swap:            0B          0B          0B

44 hours later (2,7G less):
[heat-admin@controller-0 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:            27G         20G        2,2G         69M        4,4G        5,9G
Swap:            0B          0B          0B

docker stats for the health-manager container say that it's consuming 1.087 GiB

[heat-admin@controller-0 ~]$ sudo docker stats --no-stream 3ca9ad6b14b4                                                                                                                                            
CONTAINER           CPU %               MEM USAGE / LIMIT       MEM %               NET I/O             BLOCK I/O           PIDS
3ca9ad6b14b4        21.12%              1.087 GiB / 27.32 GiB   3.98%               0 B / 0 B           0 B / 0 B           38

I haven't checked this stats before, but will add them in the next updates.

At this point, restarting the health-manager container releases 0,6G in the controller:

[heat-admin@controller-0 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:            27G         20G        2,8G         69M        4,5G        6,5G
Swap:            0B          0B          0B

[heat-admin@controller-0 ~]$ sudo docker stats --no-stream 3ca9ad6b14b4
CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
3ca9ad6b14b4        10.06%              494.3 MiB / 27.32 GiB   1.77%               0 B / 0 B           0 B / 0 B           38

After 10 minutes it goes to the previous values:
[heat-admin@controller-0 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:            27G         20G        2,2G         69M        4,5G        5,9G
Swap:            0B          0B          0B

[heat-admin@controller-0 ~]$ sudo docker stats --no-stream 3ca9ad6b14b4
CONTAINER           CPU %               MEM USAGE / LIMIT       MEM %               NET I/O             BLOCK I/O           PIDS
3ca9ad6b14b4        6.41%               1.032 GiB / 27.32 GiB   3.78%               0 B / 0 B           0 B / 0 B           38

Is this memory usage the expected one?

I would need to check the consumption during the coming days to confirm that it doesn't keep growing.

Comment 8 Jon Uriarte 2018-10-08 10:13:25 UTC
Adding more info:

I checked the openstack-octavia-health-manager container memory usage with different event_streamer_driver values.

=======================================

Check time: 3 hours
event_streamer_driver: queue_event_streamer

TIME      CPU %*   MEM USAGE*  FREE MEM**
--------  -------  --------    --------
11:48:06  38.64%   686 MiB     4,0G
12:20:36  73.44%   1.597 GiB   3,0G
12:49:59  148.79%  2.079 GiB   2,2G
13:20:24  219.19%  2.538 GiB   1,5G
13:50:15  333.64%  2.991 GiB   847M
14:24:51  394.27%  3.511 GiB   316M
14:40:16  406.43%  3.755 GiB   304M

The container memory consumption increases until it leaves the controller without memory.
If the container is restarted it releases almost 4 GB of memory.

=======================================

Check time: 2.5 days
event_streamer_driver: noop_event_streamer

TIME      CPU %*   MEM USAGE*  FREE MEM**
--------  -------  --------    --------
day 0     21.40%   771.6 MiB   4,6G
day 0.5   0.50%    1.093 GiB   3,8G
day 1     12.66%   1.095 GiB   3,3G
day 1.5   16.52%   1.098 GiB   2,9G
day 2     47.48%   1.096 GiB   4,0G
day 2.5   12.19%   1.096 GiB   3,5G

The container memory consumption does not increase now, it remains constant in around 1 GB. The free memory in the controller does decrease though, at a lower speed than before, it seems it is a different component which is taken that memory. During the test some memory was released in the controller (the free memory does increase in the day 2), but it's unrelated to the test purposes.

=======================================

* CPU % and MEM USAGE are taken with 'docker stats --no-stream <container>' command
** FREE MEM is taken with 'free-h' command

Comment 9 Carlos Goncalves 2018-10-08 10:26:12 UTC
Jon, thanks a lot for investigating and sharing concise results!

So, given that memory usage remains constant without the event streamer which we knew to be problematic (CPU and now memory), are you satisfied with the results after setting event streamer to noop? If so, we can close this rhbz as duplicate of rhbz #1618772.

Comment 10 Jon Uriarte 2018-10-08 16:00:09 UTC
(In reply to Carlos Goncalves from comment #9)
> Jon, thanks a lot for investigating and sharing concise results!
> 
> So, given that memory usage remains constant without the event streamer
> which we knew to be problematic (CPU and now memory), are you satisfied with
> the results after setting event streamer to noop? If so, we can close this
> rhbz as duplicate of rhbz #1618772.

It's fine with me as soon as the event streamer is not necessary. If you think it does not affect at all it's ok removing it.
By the way, is this parameter configurable via TripleO? or are you going to change the default value?

If so, this BZ could be re-named and used for that.

Comment 11 Carlos Goncalves 2018-10-08 16:10:45 UTC
It is configurable now via THT, yes.

OSP 13: https://bugzilla.redhat.com/show_bug.cgi?id=1624037
OSP 14: https://bugzilla.redhat.com/show_bug.cgi?id=1618772
Upstream: https://review.openstack.org/#/q/e066722d27672c6d781c71b6a07acef50b9821a6

*** This bug has been marked as a duplicate of bug 1618772 ***

Comment 12 sandiecotsisjgn 2022-06-25 18:50:12 UTC Comment hidden (spam)
Comment 13 CarolKoenig 2022-08-22 11:45:19 UTC Comment hidden (spam)
Comment 14 Alex Klarkson 2022-09-23 17:37:29 UTC Comment hidden (spam)

Note You need to log in before you can comment on or make changes to this bug.