Bug 1371985 - Possible memory leak in openshift master process during reliability long run
Summary: Possible memory leak in openshift master process during reliability long run
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Paul Morie
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-31 15:29 UTC by Vikas Laad
Modified: 2017-06-05 16:41 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-26 18:01:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Memory Utilization Graph (523.61 KB, image/png)
2016-08-31 15:29 UTC, Vikas Laad
no flags Details
pprof heap (2.79 KB, text/plain)
2016-08-31 15:30 UTC, Vikas Laad
no flags Details
pprof heap for another time (2.87 KB, text/plain)
2016-08-31 15:30 UTC, Vikas Laad
no flags Details
mem growth (29.50 KB, image/png)
2016-09-08 13:45 UTC, Vikas Laad
no flags Details
Openshift master memory (30.35 KB, image/png)
2016-10-03 13:55 UTC, Vikas Laad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1458238 0 low CLOSED OCP Master APIs are using an excessive amount of memory in containerized env 2021-02-22 00:41:40 UTC

Internal Links: 1458238

Description Vikas Laad 2016-08-31 15:29:11 UTC
Created attachment 1196444 [details]
Memory Utilization Graph

Description of problem:
Memory consumption on master goes up to 3G and still increasing after few days of running reliability tests. Please look at following bug 
https://bugzilla.redhat.com/show_bug.cgi?id=1323733#c37

Even after setting following memory consumption is increasing.
    deserialization-cache-size:
    - "1000"


Version-Release number of selected component (if applicable):
openshift v3.3.0.22
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git

How reproducible:
After running reliability tests for few days, in this case we ran it for 10 days.

Steps to Reproduce:
1. Create openshift cluster in AWS
2. Start tests which creates/builds/redeploys/delete projects and apps
3. Let it run for few days and watch memory consumption

Actual results:
Memory consumption goes up, and still growing.

Expected results:
After sometime it should not grow.

Additional info:
Please find attached the graph from Cloudwatch, Y axis is in % or total RAM. Total RAM was on master was 16G. Sudden drop in memory shows when the deserialization-cache-size was added to master config and restarted the master process. 

Also find attached pprof heap profile taken last night and this morning.

Comment 1 Vikas Laad 2016-08-31 15:30:13 UTC
Created attachment 1196445 [details]
pprof heap

Comment 2 Vikas Laad 2016-08-31 15:30:45 UTC
Created attachment 1196446 [details]
pprof heap for another time

Comment 3 Jeremy Eder 2016-08-31 16:23:02 UTC
Vikas -- if you use stress to allocate a bunch of RAM and put the system under (gentle) memory pressure, does the RSS of the master go down?

https://copr-be.cloud.fedoraproject.org/results/ndokos/pbench/epel-7-x86_64/00182790-pbench-stress/

Comment 6 Derek Carr 2016-09-01 18:55:23 UTC
After further analysis, we are not seeing continued rss growth.  rss usage stabilized and in future can be tuned further with etcd caching for origin specific objects when we revisit https://github.com/openshift/origin/pull/10719

Comment 7 Vikas Laad 2016-09-08 13:45:14 UTC
Created attachment 1199116 [details]
mem growth

I am still seeing the memory growth, graph attached shows memory consumed by only openshift-master process. Sample has taken every 15 mins in last few days.

Comment 8 Paul Morie 2016-09-08 20:36:21 UTC
This master is running with etcd embedded -- can we do this with an external etcd?  It will be far, far easier to diagnose that way.

Comment 9 Vikas Laad 2016-09-09 15:25:08 UTC
Started another run with external etcd, will update the bug again with some data.

Comment 11 Vikas Laad 2016-09-20 15:44:37 UTC
I had another cluster with same issue, after stopping the tests and deleting all the projects Memory usage on nodes came down. But the master node and master process was still taking up same memory (on this cluster it was at 2.3G). Then I ran pbench stress mentioned in comment #3, that did not make any difference. Even after forcing OOM on master using stress also did not make rss of master go down.

Comment 12 Vikas Laad 2016-09-20 21:21:20 UTC
Another note, after restarting the master process mem consumption came down to 285MB.

Comment 13 Vikas Laad 2016-10-03 13:55:25 UTC
Created attachment 1206900 [details]
Openshift master memory

Attached graph which shows the memory growth after running the tests for 22 days, growth is still there but its slow right now.


Note You need to log in before you can comment on or make changes to this bug.