Bug 1323733 - etcd cache in kube api server with large fixed size causing high memory usage
Summary: etcd cache in kube api server with large fixed size causing high memory usage
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.3.1
Assignee: Seth Jennings
QA Contact: Vikas Laad
: 1357057 (view as bug list)
Depends On:
Blocks: 1435695
TreeView+ depends on / blocked
Reported: 2016-04-04 14:27 UTC by Vikas Laad
Modified: 2017-05-24 08:03 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1435695 (view as bug list)
Last Closed: 2016-09-27 09:37:27 UTC
Target Upstream Version:

Attachments (Terms of Use)
pidstat (144.44 KB, text/plain)
2016-04-04 14:27 UTC, Vikas Laad
no flags Details
Memory Utilization Graph (269.72 KB, image/jpeg)
2016-04-04 14:29 UTC, Vikas Laad
no flags Details
memory-profile.svg (90.71 KB, image/svg+xml)
2016-04-06 16:08 UTC, Seth Jennings
no flags Details
memory-profile-2.svg (88.23 KB, image/svg+xml)
2016-04-07 14:21 UTC, Seth Jennings
no flags Details
memory-profile-3.svg (99.04 KB, image/svg+xml)
2016-04-07 18:43 UTC, Seth Jennings
no flags Details

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1933 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.3 Release Advisory 2016-09-27 13:24:36 UTC

Description Vikas Laad 2016-04-04 14:27:33 UTC
Created attachment 1143369 [details]

Description of problem:
Memory utilization by Openshift master process increasing continously when running Application Reliability tests. These tests create a bunch of applications (10 in this case) in the beginning and keep scaling/building/accessing them over the period of time. These applications are created using templates (eap64-mysql-s2i, dancer-mysql-example, cakephp-mysql). These tests were started on 3/29/2016 and they are still running, please see the graph attached, pidstat by pbench.

Here is the cluster information
1 Master and 2 Nodes created in AWS m4.xlarge

Version-Release number of selected component (if applicable):

How reproducible:
Running reliability tests for few days.

Steps to Reproduce:
1. Create a cluster in AWS 1 Master, 2 Nodes
2. Create 10 applications using different templates
3. Keep scaling up/down, building, accessing them for few days, watch memory utilization on Master node.

Actual results:
Increasing memory consumption by openshift master process

Expected results:
Memory consumption should be stable after few hours

Additional info:

Comment 1 Vikas Laad 2016-04-04 14:29:10 UTC
Created attachment 1143370 [details]
Memory Utilization Graph

Comment 2 Mike Fiedler 2016-04-04 14:33:34 UTC
As of today (4 April), there is a running environment in this state if someone wants to access it.   Contact vlaad directly for details.

Comment 3 Seth Jennings 2016-04-06 16:08:56 UTC
Created attachment 1144269 [details]

After enabling profiling and letting it run for 18 hours or so, I got this memory profile (see attachment).

reflect.Value.Call along the etcdWatcher path is the part that is growing over time.  I'm not sure if this is expected (i.e. a cache) or not (i.e. leak).

Comment 4 Andy Goldstein 2016-04-07 10:40:23 UTC
Jan, could you please take a look?

Comment 5 Jan Chaloupka 2016-04-07 13:12:28 UTC
Sure, looking on it for some time now.

Comment 6 Seth Jennings 2016-04-07 14:21:21 UTC
Created attachment 1144755 [details]

New memory profile after about 40 hours.

(pprof) top10
1870.05MB of 2096.66MB total (89.19%)
Dropped 2413 nodes (cum <= 10.48MB)
Showing top 10 nodes out of 142 (cum >= 26.51MB)
      flat  flat%   sum%        cum   cum%
 1012.05MB 48.27% 48.27%  1012.05MB 48.27%  reflect.Value.call
  385.22MB 18.37% 66.64%   385.22MB 18.37%  github.com/ugorji/go/codec.(*jsonDecDriver).DecodeString

reflect.Value.call continues to be the primary reason for the increased size; DecodeString to a lesser degree.

Comment 7 Jan Chaloupka 2016-04-07 14:42:41 UTC
From my interpretation of the memory profiling graph that the 560.31MB are accumulated inside of reflect.Value.call function [1] starting at line 314.

Compared to the latest commit in github.com/golang/go the function has been updated since go-1.4 which we have currently in RHEL. Not sure if it has any effect on the memory leak (if it is a leak). The most part of the funtion are checks.

The call function allocates memory for args variable on line 394 [2]. The args is used to store parameters for the call. Once the call is done, there is a loop that copies results of the call to ret slice. Here the code:

        off = retOffset
        for i := 0; i < nout; i++ {
                tv := t.Out(i)
                a := uintptr(tv.Align())
                off = (off + a - 1) &^ (a - 1)
                fl := flagIndir | flag(tv.Kind())
                ret[i] = Value{tv.common(), unsafe.Pointer(uintptr(args) + off), fl}
                off += tv.Size()

        return ret

If the value of retOffset is non-zero, the memory in [uintptr(args), uintptr(args) + retOffset) is never freed.

Seth, can you look at it with the second pair of eyes? I am not familiar with the reflect much. Just to check if my suggestion is not completely out of bottle.

[1] https://github.com/golang/go/blob/883bc6ed0ea815293fe6309d66f967ea60630e87/src/reflect/value.go
[2] https://github.com/golang/go/blob/883bc6ed0ea815293fe6309d66f967ea60630e87/src/reflect/value.go#L394

Comment 8 Seth Jennings 2016-04-07 16:06:46 UTC
Found this:

Apparently this is a bug in the pprof tool, that is can't flow past the reflect.Value.Call.  The reflect call itself is not responsible.

So, this is probably a leak but memory profiling isn't going to show us where it is :-/

Comment 9 Seth Jennings 2016-04-07 18:43:45 UTC
Created attachment 1144841 [details]

Ok, so Andy gave me an idea to install golang 1.6 with support for the -runtime flag and see if I couldn't get better profiling data.  It worked, surprisingly.

(pprof) top10
2.01GB of 2.11GB total (95.28%)
Dropped 2925 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 192 (cum >= 0.02GB)
      flat  flat%   sum%        cum   cum%
    1.17GB 55.45% 55.45%     1.17GB 55.45%  runtime.makeslice
    0.42GB 19.89% 75.34%     0.42GB 19.89%  runtime.slicebytetostring
    0.23GB 10.77% 86.11%     0.24GB 11.22%  runtime.mapassign1

Attached is the graph.

The largest offender calling runtime.makeslice is api.DeepCopy_api_NodeStatus at >800MB.  Looks to be a leak in the DeepCopy code, specifically that for the NodeStatus.

Comment 10 Clayton Coleman 2016-04-07 18:46:26 UTC
Can you attach the pprof output and the binary used in a zip so we can examine it?  Thanks

Comment 11 Clayton Coleman 2016-04-07 18:56:59 UTC
If deep copy is leaking you should be able to easily reproduce this with a hot loop that does nothing except copy a Node object (look at pkg/api/serialization_test.go for some benchmarks that do similar).

You should also look at --inuse_objects and count the number of *api.Node that references are being held to.  If you don't see a lot, then it's likely the number you see here is garbage related (creating transient nodes) which would point towards a hotloop of some kind.

Comment 14 Seth Jennings 2016-04-11 16:09:18 UTC
My current working theory is a that the watch cache is not freeing old Nodes when they are updated.

This is the call stack of concern:


Note the call to sendModify() and subsequent addToCache() where etcd updated the Node and the watch cache is caching that Node before returning the response to the watcher.

ROUTINE ========================
         0     1.17GB (flat, cum) 54.74% of Total
         .          .    370:		return
         .          .    371:	}
         .          .    372:	if w.include != nil && !w.include(res.Node.Key) {
         .          .    373:		return
         .          .    374:	}
         .     1.17GB    375:	curObj, err := w.decodeObject(res.Node)
         .          .    376:	if err != nil {

An exacerbating factor is the number of images on the nodes which creates large Node structures:

ROUTINE ========================

         .          .   1567:	if in.Images != nil {
         .          .   1568:		in, out := in.Images, &out.Images
         .   631.81MB   1569:		*out = make([]ContainerImage, len(in))
         .          .   1570:		for i := range in {
         .      153MB   1571:			if err := DeepCopy_api_ContainerImage(in[i], &(*out)[i], c); err != nil {
         .          .   1572:				return err
         .          .   1573:			}
         .          .   1574:		}

Comment 15 Timothy St. Clair 2016-04-12 13:29:32 UTC
I don't see how this is possible, as the watch caching was only enabled 11 hours ago - https://github.com/openshift/origin/pull/8395 .

Comment 16 Andy Goldstein 2016-04-12 13:32:09 UTC
This isn't related to the watch cache. This is a separate internal cache that etcdHelper uses when decoding list responses

Comment 17 Timothy St. Clair 2016-04-12 14:46:37 UTC
The objects in the decode cache [(*etcdHelper).addToCache] back into util/cache.go which is statically configured on startup, objects that are no longer referenced "should be cleaned" by GC, unless there is some dangling reference outside of the cache itself.

Comment 18 Seth Jennings 2016-04-12 19:20:27 UTC
I'm starting to think this is not a memory leak; just a very large cache.  The etcd cache limit is 50k items across all resource types.  For debugging, I reduced the limit to 100 items, so I could fill the cache in a reasonable time, and the number of inuse_objects allocated on the addToCache() path does stabilize, refuting the memory leak theory.  The overall RSS for openshift does continue to grow, but mostly in the etcd server code (I assume it has its own cache that is growing)

In other news, that cache is pretty poorly designed.  It is mostly full of obsolete versions of resources and the eviction policy is not LRU.  In some cases, I saw items added and then immediately evicted i.e. addToCache() would return success and the item would not be in the cache.

At the very least, we should do the eviction before the add so that with addToCache() returns, the item is guaranteed to be in the cache.

Comment 19 Timothy St. Clair 2016-04-12 20:06:42 UTC
> In other news, that cache is pretty poorly designed.  It is mostly full of 
> obsolete versions of resources and the eviction policy is not LRU.  In some 
> cases, I saw items added and then immediately evicted i.e. addToCache() would 
> return success and the item would not be in the cache.

Feel free to open an upstream item there.

Comment 20 Seth Jennings 2016-04-12 22:17:44 UTC
I've prepped a test machine with the debug build that reduces the cache size from 50k to 500 and vlaad is going to run the tests again tomorrow.  The expected result is that we will see a drop off in the RSS growth rate beyond when the cache fills to capacity.

I opened this upstream to fix the glaring issue of immediate eviction:

I am also going to open another issue for improving the cache effectiveness, which personally I think is very low right now.

I think the cache would be just as effective if it were 1% the size in the typical case (500 instead of 50k).  I assume reasoning was "1000 nodes with 50 pods each" was used to determine this value.  However, in the case of smaller setups, it just wastes a lot of memory caching obsolete resources.

Comment 21 Seth Jennings 2016-04-13 21:01:03 UTC
The test is looking good.  openshift master is ~400MB after 6hrs and doesn't seem to be growing like it did before.

I'm going to work on PR for upstream kube that makes the cache size user configurable and a PR for origin that reduces the default size of the cache.

Comment 22 Seth Jennings 2016-04-13 21:08:17 UTC
Addition information, the addToCache() paths that accounted for over 1GB of allocations before currently only account for 3MB of usage in 8348 objects.

Comment 24 Seth Jennings 2016-04-14 17:46:15 UTC
Upstream PR for user configurable etcd cache size

Comment 25 Andy Goldstein 2016-04-15 01:23:16 UTC
Looks like someone beat Seth to the punch - 24261 was closed as https://github.com/kubernetes/kubernetes/pull/23914 was opened first (still awaiting merge upstream).

Comment 26 Andy Goldstein 2016-04-18 18:56:42 UTC
Too risky for 3.2. Will pick up in a future Kube rebase.

Comment 27 Seth Jennings 2016-06-09 15:39:32 UTC
https://github.com/kubernetes/kubernetes/pull/23914 was merged upstream to address this issue.  Should be picked up in the 3.3 rebase.

Comment 28 Andy Goldstein 2016-06-09 16:23:34 UTC
Moving back to POST until https://github.com/openshift/origin/pull/8856 is merged. Once it's merged, please move to ON_QA.

Comment 29 Andy Goldstein 2016-06-09 16:24:01 UTC
Oh wait, this is for OSE. Please move to MODIFIED once it's merged.

Comment 30 Seth Jennings 2016-06-16 21:58:44 UTC
Origin rebase is complete.

Comment 31 Troy Dawson 2016-07-20 22:11:57 UTC
This has been merged and is in OSE v3.3.0.8 or newer.

Comment 32 Vikas Laad 2016-07-22 14:58:24 UTC
I have started another run of reliability tests on v3.3.0.8, will update the bug with my findings.

Comment 33 Andy Goldstein 2016-07-22 14:59:49 UTC
We actually haven't changed anything about the cache size in OpenShift. All we have is the *ability* to change it, but we haven't done so.

Comment 34 Andy Goldstein 2016-07-22 15:01:50 UTC
*** Bug 1357057 has been marked as a duplicate of this bug. ***

Comment 35 Timothy St. Clair 2016-07-22 15:30:10 UTC
Appropriate sizing should also be noted: https://docs.openshift.org/latest/install_config/install/prerequisites.html

Comment 36 Andy Goldstein 2016-07-22 18:45:36 UTC
Docs PR: https://github.com/openshift/openshift-docs/pull/2542.

I'm going to leave this ON_QA since the cache size can be changed as per the above PR.

Comment 37 Vikas Laad 2016-08-29 18:11:42 UTC
3.3 Run update :

- Default cache size is 50000, after running the tests for 10 days memory consumption increased to 4.6G by openshift master process.
- Added following setting in master-config to 1000
    - /etc/origin/cloudprovider/aws.conf
    - aws
    - "1000"

- Also enabled pprof for master process on the cluster, will monitor for few more days.

Andy, Should we create a bug to reduce default cache size to lower number or default 50K is right number?

Comment 38 Andy Goldstein 2016-08-29 18:15:27 UTC
Let's create a new bug to lower the default value, and we can argue in that bug what the value should actually be :-)

Comment 39 Vikas Laad 2016-09-02 13:36:07 UTC
Created 2 different bugs for 
1. Changing default number to something lower then 5K
2. Still memory usage found and created another bug https://bugzilla.redhat.com/show_bug.cgi?id=1371985

Making this bug verified as the the number of objects can be configured in master-config.

Comment 41 errata-xmlrpc 2016-09-27 09:37:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.