1458238 – OCP Master APIs are using an excessive amount of memory in containerized env

Bug 1458238 - OCP Master APIs are using an excessive amount of memory in containerized env

Summary: OCP Master APIs are using an excessive amount of memory in containerized env

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Michal Fojtik
QA Contact:	Chuan Yu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-02 11:22 UTC by Jaspreet Kaur
Modified:	2020-07-16 09:44 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-07 14:13:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Heap profile (1.60 MB, text/plain) 2017-06-05 15:51 UTC, Marcos Entenza	no flags	Details
openshift master RAM usage screenshot (188.40 KB, image/png) 2017-09-01 14:01 UTC, David Moreau Simard	no flags	Details
Heap after restarting origin-master (306.94 KB, text/plain) 2017-09-01 15:10 UTC, David Moreau Simard	no flags	Details
Heap after origin-master is taking up RAM (1.15 MB, text/plain) 2017-09-01 17:28 UTC, David Moreau Simard	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1371985	0	medium	CLOSED	Possible memory leak in openshift master process during reliability long run	2021-02-22 00:41:40 UTC

Internal Links: 1371985

Description Jaspreet Kaur 2017-06-02 11:22:33 UTC

Description of problem: We have seen that the masters in our environments are using a large amount of memory as part of it's API process. This has been seen around ~4GB of memory at which point the node is within ~200MB of running out.

Restarting the Master API service seems to resolve the issue, but presumably it will return.

Tried  :

kubernetesMasterConfig:
  apiServerArguments:
    deserialization-cache-size:
   - "1000"


But it doesnt make difference and it shows the same memory usage as before :


loud-user@ip-10-98-30-41 ~]$ ps aux | grep -e "openshift start master api" -e USER
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     117321 34.1 53.9 4644864 4176824 ?     Ssl  11:45 202:48 /usr/bin/openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:8443 --master=https://master.example,com:8443


Our environment consists of:
- 3 masters
- 45 nodes, three of which are non-schedulable nodes
- 554 pods
- 21 namespaces (including the default ones)


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: The api process memory goes on increasing.


Expected results:  The api process should not take much memory


Additional info: As this is a containerized env we are not able to collect profile data

Comment 1 Derek Carr 2017-06-02 14:44:00 UTC

For reference, over what time period are you seeing memory grow?  After a restart, how much memory is used, and what is the slope of memory growth over an operational period?

Comment 3 Seth Jennings 2017-06-03 20:06:05 UTC

You can get a heap profile from a containerized OpenShift master.

Edit /etc/sysconfig/origin-master (or maybe atomic-openshift-master)
Add "-e OPENSHIFT_PROFILE=web" to OPTIONS
systemctl restart atomic-openshift-master.service

Because the containerized master runs in the host network namespace, you can:

curl -s http://127.0.0.1:6060/debug/pprof/heap > heap.profile

Please attach the heap profile along with the output of "oc version"

Unfortunately, this does involve restarting the master, thus having to wait for a recreate.

Comment 4 Marcos Entenza 2017-06-05 15:51:40 UTC

Created attachment 1285079 [details]
Heap profile

Comment 5 Marcos Entenza 2017-06-05 15:52:35 UTC

Output from 'oc version':

[root@ip-10-98-10-244 ~]# /var/usrlocal/bin/oc version
oc v3.4.1.7
kubernetes v1.4.0+776c994
features: Basic-Auth GSSAPI Kerberos SPNEGO

Comment 6 Seth Jennings 2017-06-05 16:41:29 UTC

The total size of the process as reported in the heap profile is only 283MB.  I would assume you are not observing the reported issue yet.

As of now, the top memory user is etcd storage code with ~150MB between decodeNodeList() and decodeObject() results being added to the deserialization cache via addToCache().

Can you post a new heap profile when the issue occurs?

Comment 7 Marcos Entenza 2017-06-05 16:44:49 UTC

Sure, that was provided by the Customer. I'll ask them to provide a new one ones this happens again. I'm only today and tomorrow onsite. If this is not happen again tomorrow I will ask the Customer to upload it (to the case) when the issue occurs.

Comment 9 David Moreau Simard 2017-09-01 14:00:01 UTC

We are also noticing significant RAM usage of the origin master process:

===
# oc version
oc v1.5.1
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://192.168.1.17:8443
openshift v1.5.1
kubernetes v1.5.2+43a9be4

# ps aux | grep -e "openshift start master" -e USER
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      14240  3.7 74.0 19072488 5935176 ?    Ssl  Aug15 935:21 /usr/bin/openshift start master --config=/etc/origin/master/master-config.yaml --loglevel=2
===

The master process is eating up all the available RAM (8GB) until OOM killer starts killing containers.

We only have two namespaces and this is a standalone registry deployment. We are not using any applications.
We have 3 pods in total: router, registry and registry-console.

The issue seems to be exacerbated when pushing a batch of new container images to the registry that is exposed through the router.

I will enable the OPENSHIFT_PROFILE=web option and report back with a heap dump once we eventually run out of RAM again.

Comment 10 David Moreau Simard 2017-09-01 14:01:30 UTC

Created attachment 1320975 [details]
openshift master RAM usage screenshot

Comment 11 David Moreau Simard 2017-09-01 15:10:52 UTC

Created attachment 1321012 [details]
Heap after restarting origin-master

I was recommended to post a heap profile after restarting origin-master and take another dump once we notice the RAM issue. Here's the first dump, I will attach a new one once we hit critical RAM tresholds -- might take a few days.

Comment 12 David Moreau Simard 2017-09-01 17:28:06 UTC

Created attachment 1321049 [details]
Heap after origin-master is taking up RAM

It turns out it didn't take a few days for the master process to eat up all the RAM again but we're ramping up our usage so it might be related... Here's the heap dump from the master eating almost all the available RAM and with the machine starting to swap.

Comment 13 David Moreau Simard 2017-09-01 17:31:53 UTC

I just noticed the bug mentions a *containerized* master taking too much ram. Our master implementatation was deployed with openshift-ansible and is not containerized. Should I open a new bug ?

Comment 14 Derek Carr 2017-09-07 14:09:10 UTC

David - please open a new bug for your issue.  Lets keep this bug focused on the containerized use-case.  Thanks!

Note You need to log in before you can comment on or make changes to this bug.