1323733 – etcd cache in kube api server with large fixed size causing high memory usage

Bug 1323733 - etcd cache in kube api server with large fixed size causing high memory usage

Summary: etcd cache in kube api server with large fixed size causing high memory usage

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.3.1
Assignee:	Seth Jennings
QA Contact:	Vikas Laad
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1357057 (view as bug list)
Depends On:
Blocks:	1435695
TreeView+	depends on / blocked

Reported:	2016-04-04 14:27 UTC by Vikas Laad
Modified:	2020-07-16 08:44 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1435695 (view as bug list)
Environment:
Last Closed:	2016-09-27 09:37:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
pidstat (144.44 KB, text/plain) 2016-04-04 14:27 UTC, Vikas Laad	no flags	Details
Memory Utilization Graph (269.72 KB, image/jpeg) 2016-04-04 14:29 UTC, Vikas Laad	no flags	Details
memory-profile.svg (90.71 KB, image/svg+xml) 2016-04-06 16:08 UTC, Seth Jennings	no flags	Details
memory-profile-2.svg (88.23 KB, image/svg+xml) 2016-04-07 14:21 UTC, Seth Jennings	no flags	Details
memory-profile-3.svg (99.04 KB, image/svg+xml) 2016-04-07 18:43 UTC, Seth Jennings	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1933	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.3 Release Advisory	2016-09-27 13:24:36 UTC

Description Vikas Laad 2016-04-04 14:27:33 UTC

Created attachment 1143369 [details]
pidstat

Description of problem:
Memory utilization by Openshift master process increasing continously when running Application Reliability tests. These tests create a bunch of applications (10 in this case) in the beginning and keep scaling/building/accessing them over the period of time. These applications are created using templates (eap64-mysql-s2i, dancer-mysql-example, cakephp-mysql). These tests were started on 3/29/2016 and they are still running, please see the graph attached, pidstat by pbench.

Here is the cluster information
1 Master and 2 Nodes created in AWS m4.xlarge

Version-Release number of selected component (if applicable):


How reproducible:
Running reliability tests for few days.

Steps to Reproduce:
1. Create a cluster in AWS 1 Master, 2 Nodes
2. Create 10 applications using different templates
3. Keep scaling up/down, building, accessing them for few days, watch memory utilization on Master node.

Actual results:
Increasing memory consumption by openshift master process

Expected results:
Memory consumption should be stable after few hours

Additional info:

Comment 1 Vikas Laad 2016-04-04 14:29:10 UTC

Created attachment 1143370 [details]
Memory Utilization Graph

Comment 2 Mike Fiedler 2016-04-04 14:33:34 UTC

As of today (4 April), there is a running environment in this state if someone wants to access it.   Contact vlaad directly for details.

Comment 3 Seth Jennings 2016-04-06 16:08:56 UTC

Created attachment 1144269 [details]
memory-profile.svg

After enabling profiling and letting it run for 18 hours or so, I got this memory profile (see attachment).

reflect.Value.Call along the etcdWatcher path is the part that is growing over time.  I'm not sure if this is expected (i.e. a cache) or not (i.e. leak).

Comment 4 Andy Goldstein 2016-04-07 10:40:23 UTC

Jan, could you please take a look?

Comment 5 Jan Chaloupka 2016-04-07 13:12:28 UTC

Sure, looking on it for some time now.

Comment 6 Seth Jennings 2016-04-07 14:21:21 UTC

Created attachment 1144755 [details]
memory-profile-2.svg

New memory profile after about 40 hours.

(pprof) top10
1870.05MB of 2096.66MB total (89.19%)
Dropped 2413 nodes (cum <= 10.48MB)
Showing top 10 nodes out of 142 (cum >= 26.51MB)
      flat  flat%   sum%        cum   cum%
 1012.05MB 48.27% 48.27%  1012.05MB 48.27%  reflect.Value.call
  385.22MB 18.37% 66.64%   385.22MB 18.37%  github.com/ugorji/go/codec.(*jsonDecDriver).DecodeString

reflect.Value.call continues to be the primary reason for the increased size; DecodeString to a lesser degree.

Comment 7 Jan Chaloupka 2016-04-07 14:42:41 UTC

From my interpretation of the memory profiling graph that the 560.31MB are accumulated inside of reflect.Value.call function [1] starting at line 314.

Compared to the latest commit in github.com/golang/go the function has been updated since go-1.4 which we have currently in RHEL. Not sure if it has any effect on the memory leak (if it is a leak). The most part of the funtion are checks.

The call function allocates memory for args variable on line 394 [2]. The args is used to store parameters for the call. Once the call is done, there is a loop that copies results of the call to ret slice. Here the code:


        off = retOffset
        for i := 0; i < nout; i++ {
                tv := t.Out(i)
                a := uintptr(tv.Align())
                off = (off + a - 1) &^ (a - 1)
                fl := flagIndir | flag(tv.Kind())
                ret[i] = Value{tv.common(), unsafe.Pointer(uintptr(args) + off), fl}
                off += tv.Size()
        }

        return ret

If the value of retOffset is non-zero, the memory in [uintptr(args), uintptr(args) + retOffset) is never freed.

Seth, can you look at it with the second pair of eyes? I am not familiar with the reflect much. Just to check if my suggestion is not completely out of bottle.

[1] https://github.com/golang/go/blob/883bc6ed0ea815293fe6309d66f967ea60630e87/src/reflect/value.go
[2] https://github.com/golang/go/blob/883bc6ed0ea815293fe6309d66f967ea60630e87/src/reflect/value.go#L394

Comment 8 Seth Jennings 2016-04-07 16:06:46 UTC

Found this:
https://github.com/golang/go/issues/11786

Apparently this is a bug in the pprof tool, that is can't flow past the reflect.Value.Call.  The reflect call itself is not responsible.

So, this is probably a leak but memory profiling isn't going to show us where it is :-/

Comment 9 Seth Jennings 2016-04-07 18:43:45 UTC

Created attachment 1144841 [details]
memory-profile-3.svg

Ok, so Andy gave me an idea to install golang 1.6 with support for the -runtime flag and see if I couldn't get better profiling data.  It worked, surprisingly.

(pprof) top10
2.01GB of 2.11GB total (95.28%)
Dropped 2925 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 192 (cum >= 0.02GB)
      flat  flat%   sum%        cum   cum%
    1.17GB 55.45% 55.45%     1.17GB 55.45%  runtime.makeslice
    0.42GB 19.89% 75.34%     0.42GB 19.89%  runtime.slicebytetostring
    0.23GB 10.77% 86.11%     0.24GB 11.22%  runtime.mapassign1

Attached is the graph.

The largest offender calling runtime.makeslice is api.DeepCopy_api_NodeStatus at >800MB.  Looks to be a leak in the DeepCopy code, specifically that for the NodeStatus.

Comment 10 Clayton Coleman 2016-04-07 18:46:26 UTC

Can you attach the pprof output and the binary used in a zip so we can examine it?  Thanks

Comment 11 Clayton Coleman 2016-04-07 18:56:59 UTC

If deep copy is leaking you should be able to easily reproduce this with a hot loop that does nothing except copy a Node object (look at pkg/api/serialization_test.go for some benchmarks that do similar).

You should also look at --inuse_objects and count the number of *api.Node that references are being held to.  If you don't see a lot, then it's likely the number you see here is garbage related (creating transient nodes) which would point towards a hotloop of some kind.

Comment 14 Seth Jennings 2016-04-11 16:09:18 UTC

My current working theory is a that the watch cache is not freeing old Nodes when they are updated.

This is the call stack of concern:

k8s.io/kubernetes/pkg/api.DeepCopy_api_Node+0x359
runtime.call1024+0x5c
reflect.Value.call+0x10e5
reflect.Value.Call+0xbc
k8s.io/kubernetes/pkg/conversion.(*Cloner).customDeepCopy+0x279
k8s.io/kubernetes/pkg/conversion.(*Cloner).deepCopy+0x25b
k8s.io/kubernetes/pkg/conversion.(*Cloner).defaultDeepCopy+0xb95
k8s.io/kubernetes/pkg/conversion.(*Cloner).deepCopy+0x2bc
k8s.io/kubernetes/pkg/conversion.(*Cloner).DeepCopy+0xef
k8s.io/kubernetes/pkg/runtime.(*Scheme).DeepCopy+0x64
k8s.io/kubernetes/pkg/runtime.(*Scheme).Copy+0x81
k8s.io/kubernetes/pkg/storage/etcd.(*etcdHelper).addToCache+0xd6
k8s.io/kubernetes/pkg/storage/etcd.(*etcdWatcher).decodeObject+0x69c
k8s.io/kubernetes/pkg/storage/etcd.(*etcdWatcher).sendModify+0x171
k8s.io/kubernetes/pkg/storage/etcd.(*etcdWatcher).sendResult+0xde
k8s.io/kubernetes/pkg/storage/etcd.(*etcdWatcher).translate+0x5d0

Note the call to sendModify() and subsequent addToCache() where etcd updated the Node and the watch cache is caching that Node before returning the response to the watcher.

ROUTINE ========================
k8s.io/kubernetes/pkg/storage/etcd.(*etcdWatcher).sendModify
         0     1.17GB (flat, cum) 54.74% of Total
         .          .    370:		return
         .          .    371:	}
         .          .    372:	if w.include != nil && !w.include(res.Node.Key) {
         .          .    373:		return
         .          .    374:	}
         .     1.17GB    375:	curObj, err := w.decodeObject(res.Node)
         .          .    376:	if err != nil {

An exacerbating factor is the number of images on the nodes which creates large Node structures:

ROUTINE ========================
k8s.io/kubernetes/pkg/api.DeepCopy_api_NodeStatus

         .          .   1567:	if in.Images != nil {
         .          .   1568:		in, out := in.Images, &out.Images
         .   631.81MB   1569:		*out = make([]ContainerImage, len(in))
         .          .   1570:		for i := range in {
         .      153MB   1571:			if err := DeepCopy_api_ContainerImage(in[i], &(*out)[i], c); err != nil {
         .          .   1572:				return err
         .          .   1573:			}
         .          .   1574:		}

Comment 15 Timothy St. Clair 2016-04-12 13:29:32 UTC

I don't see how this is possible, as the watch caching was only enabled 11 hours ago - https://github.com/openshift/origin/pull/8395 .

Comment 16 Andy Goldstein 2016-04-12 13:32:09 UTC

This isn't related to the watch cache. This is a separate internal cache that etcdHelper uses when decoding list responses

Comment 17 Timothy St. Clair 2016-04-12 14:46:37 UTC

The objects in the decode cache [(*etcdHelper).addToCache] back into util/cache.go which is statically configured on startup, objects that are no longer referenced "should be cleaned" by GC, unless there is some dangling reference outside of the cache itself.

Comment 18 Seth Jennings 2016-04-12 19:20:27 UTC

I'm starting to think this is not a memory leak; just a very large cache.  The etcd cache limit is 50k items across all resource types.  For debugging, I reduced the limit to 100 items, so I could fill the cache in a reasonable time, and the number of inuse_objects allocated on the addToCache() path does stabilize, refuting the memory leak theory.  The overall RSS for openshift does continue to grow, but mostly in the etcd server code (I assume it has its own cache that is growing)

In other news, that cache is pretty poorly designed.  It is mostly full of obsolete versions of resources and the eviction policy is not LRU.  In some cases, I saw items added and then immediately evicted i.e. addToCache() would return success and the item would not be in the cache.

At the very least, we should do the eviction before the add so that with addToCache() returns, the item is guaranteed to be in the cache.

Comment 19 Timothy St. Clair 2016-04-12 20:06:42 UTC

> In other news, that cache is pretty poorly designed.  It is mostly full of 
> obsolete versions of resources and the eviction policy is not LRU.  In some 
> cases, I saw items added and then immediately evicted i.e. addToCache() would 
> return success and the item would not be in the cache.

Feel free to open an upstream item there.

Comment 20 Seth Jennings 2016-04-12 22:17:44 UTC

I've prepped a test machine with the debug build that reduces the cache size from 50k to 500 and vlaad is going to run the tests again tomorrow.  The expected result is that we will see a drop off in the RSS growth rate beyond when the cache fills to capacity.

I opened this upstream to fix the glaring issue of immediate eviction:
https://github.com/kubernetes/kubernetes/pull/24164

I am also going to open another issue for improving the cache effectiveness, which personally I think is very low right now.

I think the cache would be just as effective if it were 1% the size in the typical case (500 instead of 50k).  I assume reasoning was "1000 nodes with 50 pods each" was used to determine this value.  However, in the case of smaller setups, it just wastes a lot of memory caching obsolete resources.

Comment 21 Seth Jennings 2016-04-13 21:01:03 UTC

The test is looking good.  openshift master is ~400MB after 6hrs and doesn't seem to be growing like it did before.

I'm going to work on PR for upstream kube that makes the cache size user configurable and a PR for origin that reduces the default size of the cache.

Comment 22 Seth Jennings 2016-04-13 21:08:17 UTC

Addition information, the addToCache() paths that accounted for over 1GB of allocations before currently only account for 3MB of usage in 8348 objects.

Comment 24 Seth Jennings 2016-04-14 17:46:15 UTC

Upstream PR for user configurable etcd cache size
https://github.com/kubernetes/kubernetes/pull/24261

Comment 25 Andy Goldstein 2016-04-15 01:23:16 UTC

Looks like someone beat Seth to the punch - 24261 was closed as https://github.com/kubernetes/kubernetes/pull/23914 was opened first (still awaiting merge upstream).

Comment 26 Andy Goldstein 2016-04-18 18:56:42 UTC

Too risky for 3.2. Will pick up in a future Kube rebase.

Comment 27 Seth Jennings 2016-06-09 15:39:32 UTC

https://github.com/kubernetes/kubernetes/pull/23914 was merged upstream to address this issue.  Should be picked up in the 3.3 rebase.

Comment 28 Andy Goldstein 2016-06-09 16:23:34 UTC

Moving back to POST until https://github.com/openshift/origin/pull/8856 is merged. Once it's merged, please move to ON_QA.

Comment 29 Andy Goldstein 2016-06-09 16:24:01 UTC

Oh wait, this is for OSE. Please move to MODIFIED once it's merged.

Comment 30 Seth Jennings 2016-06-16 21:58:44 UTC

Origin rebase is complete.

Comment 31 Troy Dawson 2016-07-20 22:11:57 UTC

This has been merged and is in OSE v3.3.0.8 or newer.

Comment 32 Vikas Laad 2016-07-22 14:58:24 UTC

I have started another run of reliability tests on v3.3.0.8, will update the bug with my findings.

Comment 33 Andy Goldstein 2016-07-22 14:59:49 UTC

We actually haven't changed anything about the cache size in OpenShift. All we have is the *ability* to change it, but we haven't done so.

Comment 34 Andy Goldstein 2016-07-22 15:01:50 UTC

*** Bug 1357057 has been marked as a duplicate of this bug. ***

Comment 35 Timothy St. Clair 2016-07-22 15:30:10 UTC

Appropriate sizing should also be noted: https://docs.openshift.org/latest/install_config/install/prerequisites.html

Comment 36 Andy Goldstein 2016-07-22 18:45:36 UTC

Docs PR: https://github.com/openshift/openshift-docs/pull/2542.

I'm going to leave this ON_QA since the cache size can be changed as per the above PR.

Comment 37 Vikas Laad 2016-08-29 18:11:42 UTC

3.3 Run update :

- Default cache size is 50000, after running the tests for 10 days memory consumption increased to 4.6G by openshift master process.
- Added following setting in master-config to 1000
        
kubernetesMasterConfig:
  admissionConfig:
    pluginConfig:
      {}
  apiServerArguments:
    cloud-config:
    - /etc/origin/cloudprovider/aws.conf
    cloud-provider:
    - aws
    deserialization-cache-size:
    - "1000"

- Also enabled pprof for master process on the cluster, will monitor for few more days.

Andy, Should we create a bug to reduce default cache size to lower number or default 50K is right number?

Comment 38 Andy Goldstein 2016-08-29 18:15:27 UTC

Let's create a new bug to lower the default value, and we can argue in that bug what the value should actually be :-)

Comment 39 Vikas Laad 2016-09-02 13:36:07 UTC

Created 2 different bugs for 
1. Changing default number to something lower then 5K
2. Still memory usage found and created another bug https://bugzilla.redhat.com/show_bug.cgi?id=1371985

Making this bug verified as the the number of objects can be configured in master-config.

Comment 41 errata-xmlrpc 2016-09-27 09:37:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933

Note You need to log in before you can comment on or make changes to this bug.