Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1613008

Summary:

[3.10,3.11] Memory leak on master node

Product:

OpenShift Container Platform

Reporter:

Vikas Laad <vlaad>

Component:

Master

Assignee:

Stefan Schimanski <sttts>

Status:

CLOSED DEFERRED

QA Contact:

Xingxing Xia <xxia>

Severity:

low

Docs Contact:

Priority:

unspecified

Version:

3.10.0

CC:

aos-bugs, jeder, jokerman, mifiedle, mmccomas, schituku, vlaad

Target Milestone:

---

Target Release:

3.10.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-11-20 19:09:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
memory usage	none
memory on master	none
few graphs from prometheus	none
api-memory.png	none
controllers-1-memory.png	none
controller-2-memory.png	none
api-2-memory.png	none
api-3-memory.png	none
3.11 Memory graph on master	none
3.11 pods memory graph on master	none
3.11 Memory graph on master based on docker	none
3.11 pods memory graph on master based on docker	none

Description Vikas Laad 2018-08-06 17:52:39 UTC

Description of problem:
I have a cluster with Openshift 3.10 installed, I am running reliability tests on it.

https://github.com/openshift/svt/tree/master/reliability

Reliability tests create bunch of quickstart applications on the cluster and have them running. It also frequently access/scale/build/delete and create application overtime. This cluster has following nodes.
1 master
1 etcd (separate)
1 infra
2 compute

For previous releases we saw memory growth on master for a week and then it will be constant after a week. On this 3.10 cluster we still see it growing after running the tests for 4 weeks. I am attaching a graph of total memory used on the system, I collected that data using cloudwatch.

Version-Release number of selected component (if applicable):
openshift v3.10.12

Steps to Reproduce:
1. create openshift 3.10 cluster
2. run reliability tests using https://github.com/openshift/svt/tree/master/reliability
3. watch memory on master

Actual results:
Memory usage keeps growing on master

Expected results:
After some time memory growth should stop.

Additional info:
Cloudwatch reports memory usage data from /proc/meminfo file, please see https://github.com/vikaslaad/aws-scripts-mon/blob/master/mon-put-instance-data.pl for more details.

Comment 1 Vikas Laad 2018-08-06 17:53:13 UTC

Created attachment 1473702 [details]
memory usage

Comment 2 Vikas Laad 2018-08-06 17:55:01 UTC

I still have the cluster around, please let me know if you want to look at it. The blue line on the graph is memory usage on master.

Comment 3 Michal Fojtik 2018-08-07 10:34:28 UTC

Can we get prometheus metrics from this cluster to see what process is causing the memory grow?

Also some object counts (how many images, daemonsets, secrets, etc..)...

Comment 4 Vikas Laad 2018-08-07 15:47:54 UTC

Created attachment 1474034 [details]
memory on master

Please see attached prometheus data for few mins, we were trying to configure prometheus for longer duration and we lost data. I will update this bz again when we have some more data. please let me know if you need anything else.

root@ip-172-31-13-187: ~ # oc get project | wc -l
35

root@ip-172-31-13-187: ~ # oc get images | wc -l
219

root@ip-172-31-13-187: ~ # oc get ds --all-namespaces | wc -l
9

root@ip-172-31-13-187: ~ # oc get secrets --all-namespaces  | wc -l
537

Comment 5 Vikas Laad 2018-08-09 13:40:12 UTC

Created attachment 1474681 [details]
few graphs from prometheus

Comment 6 Jordan Liggitt 2018-08-16 15:17:57 UTC

do we know which process is using the memory? (apiserver, controllers, etc)

Comment 8 Jordan Liggitt 2018-08-17 14:15:42 UTC

Created attachment 1476633 [details]
api-memory.png

Comment 9 Jordan Liggitt 2018-08-17 14:16:13 UTC

Created attachment 1476634 [details]
controllers-1-memory.png

Comment 10 Jordan Liggitt 2018-08-17 14:18:17 UTC

Created attachment 1476635 [details]
controller-2-memory.png

Comment 11 Jordan Liggitt 2018-08-17 14:18:46 UTC

Created attachment 1476636 [details]
api-2-memory.png

Comment 12 Jordan Liggitt 2018-08-17 14:19:14 UTC

Created attachment 1476637 [details]
api-3-memory.png

Comment 13 Jordan Liggitt 2018-08-17 14:20:22 UTC

I'm seeing memory growth of ~10MB per day in some components. That amount of growth doesn't seem concerning to me. Are we seeing faster growth under certain tests/workloads?

Comment 15 Siva Reddy 2018-09-14 15:16:54 UTC

Created attachment 1483355 [details]
3.11 Memory graph on master

This is the graph for 9 days of the memory usage

Comment 16 Siva Reddy 2018-09-14 15:17:37 UTC

Created attachment 1483356 [details]
3.11 pods memory graph on master

Comment 18 Siva Reddy 2018-09-25 18:09:59 UTC

Created attachment 1486854 [details]
3.11 Memory graph on master based on docker

3.11 docker environment memory consumption graph on master node

Comment 19 Siva Reddy 2018-09-25 18:11:16 UTC

Created attachment 1486856 [details]
3.11 pods memory graph on master based on docker

3.11 docker environment memory graphs of the pods that show mem leak. The duration of graph is 11 days

Comment 20 Stephen Cuppett 2019-11-20 19:09:44 UTC

OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift