Bug 1371985

Summary: Possible memory leak in openshift master process during reliability long run
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: NodeAssignee: Paul Morie <pmorie>
Status: CLOSED INSUFFICIENT_DATA QA Contact: DeShuai Ma <dma>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3.0CC: agoldste, aos-bugs, decarr, jeder, jokerman, mifiedle, mmccomas, pep, tschan+redhat, tstclair
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-26 18:01:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Memory Utilization Graph
none
pprof heap
none
pprof heap for another time
none
mem growth
none
Openshift master memory none

Description Vikas Laad 2016-08-31 15:29:11 UTC
Created attachment 1196444 [details]
Memory Utilization Graph

Description of problem:
Memory consumption on master goes up to 3G and still increasing after few days of running reliability tests. Please look at following bug 
https://bugzilla.redhat.com/show_bug.cgi?id=1323733#c37

Even after setting following memory consumption is increasing.
    deserialization-cache-size:
    - "1000"


Version-Release number of selected component (if applicable):
openshift v3.3.0.22
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git

How reproducible:
After running reliability tests for few days, in this case we ran it for 10 days.

Steps to Reproduce:
1. Create openshift cluster in AWS
2. Start tests which creates/builds/redeploys/delete projects and apps
3. Let it run for few days and watch memory consumption

Actual results:
Memory consumption goes up, and still growing.

Expected results:
After sometime it should not grow.

Additional info:
Please find attached the graph from Cloudwatch, Y axis is in % or total RAM. Total RAM was on master was 16G. Sudden drop in memory shows when the deserialization-cache-size was added to master config and restarted the master process. 

Also find attached pprof heap profile taken last night and this morning.

Comment 1 Vikas Laad 2016-08-31 15:30:13 UTC
Created attachment 1196445 [details]
pprof heap

Comment 2 Vikas Laad 2016-08-31 15:30:45 UTC
Created attachment 1196446 [details]
pprof heap for another time

Comment 3 Jeremy Eder 2016-08-31 16:23:02 UTC
Vikas -- if you use stress to allocate a bunch of RAM and put the system under (gentle) memory pressure, does the RSS of the master go down?

https://copr-be.cloud.fedoraproject.org/results/ndokos/pbench/epel-7-x86_64/00182790-pbench-stress/

Comment 6 Derek Carr 2016-09-01 18:55:23 UTC
After further analysis, we are not seeing continued rss growth.  rss usage stabilized and in future can be tuned further with etcd caching for origin specific objects when we revisit https://github.com/openshift/origin/pull/10719

Comment 7 Vikas Laad 2016-09-08 13:45:14 UTC
Created attachment 1199116 [details]
mem growth

I am still seeing the memory growth, graph attached shows memory consumed by only openshift-master process. Sample has taken every 15 mins in last few days.

Comment 8 Paul Morie 2016-09-08 20:36:21 UTC
This master is running with etcd embedded -- can we do this with an external etcd?  It will be far, far easier to diagnose that way.

Comment 9 Vikas Laad 2016-09-09 15:25:08 UTC
Started another run with external etcd, will update the bug again with some data.

Comment 11 Vikas Laad 2016-09-20 15:44:37 UTC
I had another cluster with same issue, after stopping the tests and deleting all the projects Memory usage on nodes came down. But the master node and master process was still taking up same memory (on this cluster it was at 2.3G). Then I ran pbench stress mentioned in comment #3, that did not make any difference. Even after forcing OOM on master using stress also did not make rss of master go down.

Comment 12 Vikas Laad 2016-09-20 21:21:20 UTC
Another note, after restarting the master process mem consumption came down to 285MB.

Comment 13 Vikas Laad 2016-10-03 13:55:25 UTC
Created attachment 1206900 [details]
Openshift master memory

Attached graph which shows the memory growth after running the tests for 22 days, growth is still there but its slow right now.