Description of problem: We have seen that the masters in our environments are using a large amount of memory as part of it's API process. This has been seen around ~4GB of memory at which point the node is within ~200MB of running out. Restarting the Master API service seems to resolve the issue, but presumably it will return. Tried : kubernetesMasterConfig: apiServerArguments: deserialization-cache-size: - "1000" But it doesnt make difference and it shows the same memory usage as before : loud-user@ip-10-98-30-41 ~]$ ps aux | grep -e "openshift start master api" -e USER USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 117321 34.1 53.9 4644864 4176824 ? Ssl 11:45 202:48 /usr/bin/openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:8443 --master=https://master.example,com:8443 Our environment consists of: - 3 masters - 45 nodes, three of which are non-schedulable nodes - 554 pods - 21 namespaces (including the default ones) Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: The api process memory goes on increasing. Expected results: The api process should not take much memory Additional info: As this is a containerized env we are not able to collect profile data
For reference, over what time period are you seeing memory grow? After a restart, how much memory is used, and what is the slope of memory growth over an operational period?
You can get a heap profile from a containerized OpenShift master. Edit /etc/sysconfig/origin-master (or maybe atomic-openshift-master) Add "-e OPENSHIFT_PROFILE=web" to OPTIONS systemctl restart atomic-openshift-master.service Because the containerized master runs in the host network namespace, you can: curl -s http://127.0.0.1:6060/debug/pprof/heap > heap.profile Please attach the heap profile along with the output of "oc version" Unfortunately, this does involve restarting the master, thus having to wait for a recreate.
Created attachment 1285079 [details] Heap profile
Output from 'oc version': [root@ip-10-98-10-244 ~]# /var/usrlocal/bin/oc version oc v3.4.1.7 kubernetes v1.4.0+776c994 features: Basic-Auth GSSAPI Kerberos SPNEGO
The total size of the process as reported in the heap profile is only 283MB. I would assume you are not observing the reported issue yet. As of now, the top memory user is etcd storage code with ~150MB between decodeNodeList() and decodeObject() results being added to the deserialization cache via addToCache(). Can you post a new heap profile when the issue occurs?
Sure, that was provided by the Customer. I'll ask them to provide a new one ones this happens again. I'm only today and tomorrow onsite. If this is not happen again tomorrow I will ask the Customer to upload it (to the case) when the issue occurs.
We are also noticing significant RAM usage of the origin master process: === # oc version oc v1.5.1 kubernetes v1.5.2+43a9be4 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://192.168.1.17:8443 openshift v1.5.1 kubernetes v1.5.2+43a9be4 # ps aux | grep -e "openshift start master" -e USER USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 14240 3.7 74.0 19072488 5935176 ? Ssl Aug15 935:21 /usr/bin/openshift start master --config=/etc/origin/master/master-config.yaml --loglevel=2 === The master process is eating up all the available RAM (8GB) until OOM killer starts killing containers. We only have two namespaces and this is a standalone registry deployment. We are not using any applications. We have 3 pods in total: router, registry and registry-console. The issue seems to be exacerbated when pushing a batch of new container images to the registry that is exposed through the router. I will enable the OPENSHIFT_PROFILE=web option and report back with a heap dump once we eventually run out of RAM again.
Created attachment 1320975 [details] openshift master RAM usage screenshot
Created attachment 1321012 [details] Heap after restarting origin-master I was recommended to post a heap profile after restarting origin-master and take another dump once we notice the RAM issue. Here's the first dump, I will attach a new one once we hit critical RAM tresholds -- might take a few days.
Created attachment 1321049 [details] Heap after origin-master is taking up RAM It turns out it didn't take a few days for the master process to eat up all the RAM again but we're ramping up our usage so it might be related... Here's the heap dump from the master eating almost all the available RAM and with the machine starting to swap.
I just noticed the bug mentions a *containerized* master taking too much ram. Our master implementatation was deployed with openshift-ansible and is not containerized. Should I open a new bug ?
David - please open a new bug for your issue. Lets keep this bug focused on the containerized use-case. Thanks!