Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1613008

Summary: [3.10,3.11] Memory leak on master node
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: MasterAssignee: Stefan Schimanski <sttts>
Status: CLOSED DEFERRED QA Contact: Xingxing Xia <xxia>
Severity: low Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, jeder, jokerman, mifiedle, mmccomas, schituku, vlaad
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-20 19:09:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
memory usage
none
memory on master
none
few graphs from prometheus
none
api-memory.png
none
controllers-1-memory.png
none
controller-2-memory.png
none
api-2-memory.png
none
api-3-memory.png
none
3.11 Memory graph on master
none
3.11 pods memory graph on master
none
3.11 Memory graph on master based on docker
none
3.11 pods memory graph on master based on docker none

Description Vikas Laad 2018-08-06 17:52:39 UTC
Description of problem:
I have a cluster with Openshift 3.10 installed, I am running reliability tests on it.

https://github.com/openshift/svt/tree/master/reliability

Reliability tests create bunch of quickstart applications on the cluster and have them running. It also frequently access/scale/build/delete and create application overtime. This cluster has following nodes.
1 master
1 etcd (separate)
1 infra
2 compute

For previous releases we saw memory growth on master for a week and then it will be constant after a week. On this 3.10 cluster we still see it growing after running the tests for 4 weeks. I am attaching a graph of total memory used on the system, I collected that data using cloudwatch.

Version-Release number of selected component (if applicable):
openshift v3.10.12

Steps to Reproduce:
1. create openshift 3.10 cluster
2. run reliability tests using https://github.com/openshift/svt/tree/master/reliability
3. watch memory on master

Actual results:
Memory usage keeps growing on master

Expected results:
After some time memory growth should stop.

Additional info:
Cloudwatch reports memory usage data from /proc/meminfo file, please see https://github.com/vikaslaad/aws-scripts-mon/blob/master/mon-put-instance-data.pl for more details.

Comment 1 Vikas Laad 2018-08-06 17:53:13 UTC
Created attachment 1473702 [details]
memory usage

Comment 2 Vikas Laad 2018-08-06 17:55:01 UTC
I still have the cluster around, please let me know if you want to look at it. The blue line on the graph is memory usage on master.

Comment 3 Michal Fojtik 2018-08-07 10:34:28 UTC
Can we get prometheus metrics from this cluster to see what process is causing the memory grow?

Also some object counts (how many images, daemonsets, secrets, etc..)...

Comment 4 Vikas Laad 2018-08-07 15:47:54 UTC
Created attachment 1474034 [details]
memory on master

Please see attached prometheus data for few mins, we were trying to configure prometheus for longer duration and we lost data. I will update this bz again when we have some more data. please let me know if you need anything else.

root@ip-172-31-13-187: ~ # oc get project | wc -l
35

root@ip-172-31-13-187: ~ # oc get images | wc -l
219

root@ip-172-31-13-187: ~ # oc get ds --all-namespaces | wc -l
9

root@ip-172-31-13-187: ~ # oc get secrets --all-namespaces  | wc -l
537

Comment 5 Vikas Laad 2018-08-09 13:40:12 UTC
Created attachment 1474681 [details]
few graphs from prometheus

Comment 6 Jordan Liggitt 2018-08-16 15:17:57 UTC
do we know which process is using the memory? (apiserver, controllers, etc)

Comment 8 Jordan Liggitt 2018-08-17 14:15:42 UTC
Created attachment 1476633 [details]
api-memory.png

Comment 9 Jordan Liggitt 2018-08-17 14:16:13 UTC
Created attachment 1476634 [details]
controllers-1-memory.png

Comment 10 Jordan Liggitt 2018-08-17 14:18:17 UTC
Created attachment 1476635 [details]
controller-2-memory.png

Comment 11 Jordan Liggitt 2018-08-17 14:18:46 UTC
Created attachment 1476636 [details]
api-2-memory.png

Comment 12 Jordan Liggitt 2018-08-17 14:19:14 UTC
Created attachment 1476637 [details]
api-3-memory.png

Comment 13 Jordan Liggitt 2018-08-17 14:20:22 UTC
I'm seeing memory growth of ~10MB per day in some components. That amount of growth doesn't seem concerning to me. Are we seeing faster growth under certain tests/workloads?

Comment 15 Siva Reddy 2018-09-14 15:16:54 UTC
Created attachment 1483355 [details]
3.11 Memory graph on master

This is the graph for 9 days of the memory usage

Comment 16 Siva Reddy 2018-09-14 15:17:37 UTC
Created attachment 1483356 [details]
3.11 pods memory graph on master

Comment 18 Siva Reddy 2018-09-25 18:09:59 UTC
Created attachment 1486854 [details]
3.11 Memory graph on master based on docker

3.11 docker environment memory consumption graph on master node

Comment 19 Siva Reddy 2018-09-25 18:11:16 UTC
Created attachment 1486856 [details]
3.11 pods memory graph on master based on docker

3.11 docker environment memory graphs of the pods that show mem leak. The duration of graph is 11 days

Comment 20 Stephen Cuppett 2019-11-20 19:09:44 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift