1489501 – Memory usage for origin-master is abnormally high

Bug 1489501 - Memory usage for origin-master is abnormally high

Summary: Memory usage for origin-master is abnormally high

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Stefan Schimanski
QA Contact:	Wang Haoran
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-07 14:46 UTC by David Moreau Simard
Modified:	2019-08-02 09:09 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-02 09:09:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Heap shortly after restarting origin-master (306.94 KB, text/plain) 2017-09-07 14:48 UTC, David Moreau Simard	no flags	Details
Heap once origin-master is using most of the available RAM (1.15 MB, text/plain) 2017-09-07 14:48 UTC, David Moreau Simard	no flags	Details
origin-master RAM usage screenshot (188.40 KB, image/png) 2017-09-07 14:49 UTC, David Moreau Simard	no flags	Details
origin-master logs (4.25 MB, application/x-gzip) 2017-09-07 14:50 UTC, David Moreau Simard	no flags	Details
View All

Description David Moreau Simard 2017-09-07 14:46:33 UTC

Also see my comments in what is perhaps a related bug: https://bugzilla.redhat.com/show_bug.cgi?id=1458238#c9

Our all-in-one (no HA, no cluster, local storage) implementation of OpenShift Standalone Registry is seeing what seems like abnormally high RAM usage of the origin-master process.

===
# oc version
oc v1.5.1
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://192.168.1.17:8443
openshift v1.5.1
kubernetes v1.5.2+43a9be4

# ps aux | grep -e "openshift start master" -e USER
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      14240  3.7 74.0 19072488 5935176 ?    Ssl  Aug15 935:21 /usr/bin/openshift start master --config=/etc/origin/master/master-config.yaml --loglevel=2

# docker info
Containers: 12
 Running: 6
 Paused: 0
 Stopped: 6
Images: 9
Server Version: 1.12.6
Storage Driver: overlay2
 Backing Filesystem: xfs
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: host bridge null overlay
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp selinux
Kernel Version: 3.10.0-514.21.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 4
Total Memory: 7.639 GiB
Name: registry.rdoproject.org.rdocloud
ID: HBS7:WFL6:5QFN:KM7M:TTSD:I557:J565:TVEJ:C6CE:ZEL2:GJCY:FPZQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 172.30.0.0/16
 127.0.0.0/8
Registries: docker.io (secure)
===


It is easily consuming over 6GB of RAM by itself, making a 8GB RAM node swap well into 1.5GB of swap territory.

Restarting origin-master frees up the memory but it eventually goes back up again -- output from our monitoring this morning (yes, within 15 minutes):

09:15:28 CheckRAM OK: 41.23% RAM free
09:31:22 CheckRAM WARNING: 9.4% RAM free

We used openshift-ansible to deploy this all-in-one node, the code to deploy it is available here [1] and the variables passed to the openshift roles are here [2].

If I had to take a guess, this memory usage might be caused by a significant amount of subsequent "docker push" operations (~100 images) after building a new batch of images.

[1]: https://github.com/rdo-infra/rdo-container-registry
[2]: https://github.com/rdo-infra/rdo-container-registry/blob/master/group_vars/OSEv3.yml

Comment 1 David Moreau Simard 2017-09-07 14:48:04 UTC

Created attachment 1323237 [details]
Heap shortly after restarting origin-master

Comment 2 David Moreau Simard 2017-09-07 14:48:50 UTC

Created attachment 1323238 [details]
Heap once origin-master is using most of the available RAM

Comment 3 David Moreau Simard 2017-09-07 14:49:33 UTC

Created attachment 1323239 [details]
origin-master RAM usage screenshot

Comment 4 David Moreau Simard 2017-09-07 14:50:24 UTC

Created attachment 1323240 [details]
origin-master logs

Comment 5 Michal Fojtik 2017-09-11 10:04:38 UTC

From the master log, it seems like you are getting a LOT of images. Can you please check how many images you have via `oc get images | wc -l` (as admin)?

If there are many images, seems like you need to run prunning?

Comment 6 Michal Fojtik 2017-09-11 10:13:42 UTC

How often you run the 100 builds? If you never prune, it means OpenShift is keeping record about every image you ever build. There is default "re-list" interval (which is ~15 minutes) that will load all images into a cache for the API server. That might explain why the memory usage went up after 15 minutes.

What I can recommend is setup really aggressive pruning so you get rid of images that are not referenced by any image stream. That should free up the cache and cause the memory usage go down.

Comment 7 David Moreau Simard 2017-09-11 14:22:36 UTC

As per discussed, a namedspaced 'oc get images -n <project> | wc -l' returned 39469. We build those images several times a day as part of a continuous testing and integration pipeline.

I'll read the documentation on pruning and see if it helps.

Comment 8 David Moreau Simard 2017-11-28 16:31:54 UTC

FWIW memory usage on OpenShift 3.7 seems much, much better.

Some stats at a quick glance so far of this new deployment:
- 10619 images
- 8203 tags
- 248 image streams
- 341GB disk space used

# free -m
              total        used        free      shared  buff/cache   available
Mem:           7822        3026         390           1        4405        4425
Swap:             0           0           0

Comment 9 Stefan Schimanski 2019-08-02 09:09:56 UTC

Seems to be fine since 3.7. Closing.

Note You need to log in before you can comment on or make changes to this bug.