Bug 1458238

Summary: OCP Master APIs are using an excessive amount of memory in containerized env
Product: OpenShift Container Platform Reporter: Jaspreet Kaur <jkaur>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED NOTABUG QA Contact: Chuan Yu <chuyu>
Severity: low Docs Contact:
Priority: low    
Version: 3.5.0CC: aos-bugs, decarr, dmsimard, jokerman, mak, mmccomas, wmeng
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-07 14:13:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Heap profile
none
openshift master RAM usage screenshot
none
Heap after restarting origin-master
none
Heap after origin-master is taking up RAM none

Description Jaspreet Kaur 2017-06-02 11:22:33 UTC
Description of problem: We have seen that the masters in our environments are using a large amount of memory as part of it's API process. This has been seen around ~4GB of memory at which point the node is within ~200MB of running out.

Restarting the Master API service seems to resolve the issue, but presumably it will return.

Tried  :

kubernetesMasterConfig:
  apiServerArguments:
    deserialization-cache-size:
   - "1000"


But it doesnt make difference and it shows the same memory usage as before :


loud-user@ip-10-98-30-41 ~]$ ps aux | grep -e "openshift start master api" -e USER
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     117321 34.1 53.9 4644864 4176824 ?     Ssl  11:45 202:48 /usr/bin/openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:8443 --master=https://master.example,com:8443


Our environment consists of:
- 3 masters
- 45 nodes, three of which are non-schedulable nodes
- 554 pods
- 21 namespaces (including the default ones)


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: The api process memory goes on increasing.


Expected results:  The api process should not take much memory


Additional info: As this is a containerized env we are not able to collect profile data

Comment 1 Derek Carr 2017-06-02 14:44:00 UTC
For reference, over what time period are you seeing memory grow?  After a restart, how much memory is used, and what is the slope of memory growth over an operational period?

Comment 3 Seth Jennings 2017-06-03 20:06:05 UTC
You can get a heap profile from a containerized OpenShift master.

Edit /etc/sysconfig/origin-master (or maybe atomic-openshift-master)
Add "-e OPENSHIFT_PROFILE=web" to OPTIONS
systemctl restart atomic-openshift-master.service

Because the containerized master runs in the host network namespace, you can:

curl -s http://127.0.0.1:6060/debug/pprof/heap > heap.profile

Please attach the heap profile along with the output of "oc version"

Unfortunately, this does involve restarting the master, thus having to wait for a recreate.

Comment 4 Marcos Entenza 2017-06-05 15:51:40 UTC
Created attachment 1285079 [details]
Heap profile

Comment 5 Marcos Entenza 2017-06-05 15:52:35 UTC
Output from 'oc version':

[root@ip-10-98-10-244 ~]# /var/usrlocal/bin/oc version
oc v3.4.1.7
kubernetes v1.4.0+776c994
features: Basic-Auth GSSAPI Kerberos SPNEGO

Comment 6 Seth Jennings 2017-06-05 16:41:29 UTC
The total size of the process as reported in the heap profile is only 283MB.  I would assume you are not observing the reported issue yet.

As of now, the top memory user is etcd storage code with ~150MB between decodeNodeList() and decodeObject() results being added to the deserialization cache via addToCache().

Can you post a new heap profile when the issue occurs?

Comment 7 Marcos Entenza 2017-06-05 16:44:49 UTC
Sure, that was provided by the Customer. I'll ask them to provide a new one ones this happens again. I'm only today and tomorrow onsite. If this is not happen again tomorrow I will ask the Customer to upload it (to the case) when the issue occurs.

Comment 9 David Moreau Simard 2017-09-01 14:00:01 UTC
We are also noticing significant RAM usage of the origin master process:

===
# oc version
oc v1.5.1
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://192.168.1.17:8443
openshift v1.5.1
kubernetes v1.5.2+43a9be4

# ps aux | grep -e "openshift start master" -e USER
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      14240  3.7 74.0 19072488 5935176 ?    Ssl  Aug15 935:21 /usr/bin/openshift start master --config=/etc/origin/master/master-config.yaml --loglevel=2
===

The master process is eating up all the available RAM (8GB) until OOM killer starts killing containers.

We only have two namespaces and this is a standalone registry deployment. We are not using any applications.
We have 3 pods in total: router, registry and registry-console.

The issue seems to be exacerbated when pushing a batch of new container images to the registry that is exposed through the router.

I will enable the OPENSHIFT_PROFILE=web option and report back with a heap dump once we eventually run out of RAM again.

Comment 10 David Moreau Simard 2017-09-01 14:01:30 UTC
Created attachment 1320975 [details]
openshift master RAM usage screenshot

Comment 11 David Moreau Simard 2017-09-01 15:10:52 UTC
Created attachment 1321012 [details]
Heap after restarting origin-master

I was recommended to post a heap profile after restarting origin-master and take another dump once we notice the RAM issue. Here's the first dump, I will attach a new one once we hit critical RAM tresholds -- might take a few days.

Comment 12 David Moreau Simard 2017-09-01 17:28:06 UTC
Created attachment 1321049 [details]
Heap after origin-master is taking up RAM

It turns out it didn't take a few days for the master process to eat up all the RAM again but we're ramping up our usage so it might be related... Here's the heap dump from the master eating almost all the available RAM and with the machine starting to swap.

Comment 13 David Moreau Simard 2017-09-01 17:31:53 UTC
I just noticed the bug mentions a *containerized* master taking too much ram. Our master implementatation was deployed with openshift-ansible and is not containerized. Should I open a new bug ?

Comment 14 Derek Carr 2017-09-07 14:09:10 UTC
David - please open a new bug for your issue.  Lets keep this bug focused on the containerized use-case.  Thanks!