Bug 1323733
Summary: | etcd cache in kube api server with large fixed size causing high memory usage | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> | ||||||||||||
Component: | Node | Assignee: | Seth Jennings <sjenning> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Vikas Laad <vlaad> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 3.2.0 | CC: | agoldste, aos-bugs, ccoleman, jkaur, jokerman, mifiedle, mmccomas, pep, sjenning, stwalter, tdawson, tstclair, xtian | ||||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||||
Target Release: | 3.3.1 | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | |||||||||||||||
: | 1435695 (view as bug list) | Environment: | |||||||||||||
Last Closed: | 2016-09-27 09:37:27 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 1435695 | ||||||||||||||
Attachments: |
|
Created attachment 1143370 [details]
Memory Utilization Graph
As of today (4 April), there is a running environment in this state if someone wants to access it. Contact vlaad directly for details. Created attachment 1144269 [details]
memory-profile.svg
After enabling profiling and letting it run for 18 hours or so, I got this memory profile (see attachment).
reflect.Value.Call along the etcdWatcher path is the part that is growing over time. I'm not sure if this is expected (i.e. a cache) or not (i.e. leak).
Jan, could you please take a look? Sure, looking on it for some time now. Created attachment 1144755 [details]
memory-profile-2.svg
New memory profile after about 40 hours.
(pprof) top10
1870.05MB of 2096.66MB total (89.19%)
Dropped 2413 nodes (cum <= 10.48MB)
Showing top 10 nodes out of 142 (cum >= 26.51MB)
flat flat% sum% cum cum%
1012.05MB 48.27% 48.27% 1012.05MB 48.27% reflect.Value.call
385.22MB 18.37% 66.64% 385.22MB 18.37% github.com/ugorji/go/codec.(*jsonDecDriver).DecodeString
reflect.Value.call continues to be the primary reason for the increased size; DecodeString to a lesser degree.
From my interpretation of the memory profiling graph that the 560.31MB are accumulated inside of reflect.Value.call function [1] starting at line 314. Compared to the latest commit in github.com/golang/go the function has been updated since go-1.4 which we have currently in RHEL. Not sure if it has any effect on the memory leak (if it is a leak). The most part of the funtion are checks. The call function allocates memory for args variable on line 394 [2]. The args is used to store parameters for the call. Once the call is done, there is a loop that copies results of the call to ret slice. Here the code: off = retOffset for i := 0; i < nout; i++ { tv := t.Out(i) a := uintptr(tv.Align()) off = (off + a - 1) &^ (a - 1) fl := flagIndir | flag(tv.Kind()) ret[i] = Value{tv.common(), unsafe.Pointer(uintptr(args) + off), fl} off += tv.Size() } return ret If the value of retOffset is non-zero, the memory in [uintptr(args), uintptr(args) + retOffset) is never freed. Seth, can you look at it with the second pair of eyes? I am not familiar with the reflect much. Just to check if my suggestion is not completely out of bottle. [1] https://github.com/golang/go/blob/883bc6ed0ea815293fe6309d66f967ea60630e87/src/reflect/value.go [2] https://github.com/golang/go/blob/883bc6ed0ea815293fe6309d66f967ea60630e87/src/reflect/value.go#L394 Found this: https://github.com/golang/go/issues/11786 Apparently this is a bug in the pprof tool, that is can't flow past the reflect.Value.Call. The reflect call itself is not responsible. So, this is probably a leak but memory profiling isn't going to show us where it is :-/ Created attachment 1144841 [details]
memory-profile-3.svg
Ok, so Andy gave me an idea to install golang 1.6 with support for the -runtime flag and see if I couldn't get better profiling data. It worked, surprisingly.
(pprof) top10
2.01GB of 2.11GB total (95.28%)
Dropped 2925 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 192 (cum >= 0.02GB)
flat flat% sum% cum cum%
1.17GB 55.45% 55.45% 1.17GB 55.45% runtime.makeslice
0.42GB 19.89% 75.34% 0.42GB 19.89% runtime.slicebytetostring
0.23GB 10.77% 86.11% 0.24GB 11.22% runtime.mapassign1
Attached is the graph.
The largest offender calling runtime.makeslice is api.DeepCopy_api_NodeStatus at >800MB. Looks to be a leak in the DeepCopy code, specifically that for the NodeStatus.
Can you attach the pprof output and the binary used in a zip so we can examine it? Thanks If deep copy is leaking you should be able to easily reproduce this with a hot loop that does nothing except copy a Node object (look at pkg/api/serialization_test.go for some benchmarks that do similar). You should also look at --inuse_objects and count the number of *api.Node that references are being held to. If you don't see a lot, then it's likely the number you see here is garbage related (creating transient nodes) which would point towards a hotloop of some kind. My current working theory is a that the watch cache is not freeing old Nodes when they are updated. This is the call stack of concern: k8s.io/kubernetes/pkg/api.DeepCopy_api_Node+0x359 runtime.call1024+0x5c reflect.Value.call+0x10e5 reflect.Value.Call+0xbc k8s.io/kubernetes/pkg/conversion.(*Cloner).customDeepCopy+0x279 k8s.io/kubernetes/pkg/conversion.(*Cloner).deepCopy+0x25b k8s.io/kubernetes/pkg/conversion.(*Cloner).defaultDeepCopy+0xb95 k8s.io/kubernetes/pkg/conversion.(*Cloner).deepCopy+0x2bc k8s.io/kubernetes/pkg/conversion.(*Cloner).DeepCopy+0xef k8s.io/kubernetes/pkg/runtime.(*Scheme).DeepCopy+0x64 k8s.io/kubernetes/pkg/runtime.(*Scheme).Copy+0x81 k8s.io/kubernetes/pkg/storage/etcd.(*etcdHelper).addToCache+0xd6 k8s.io/kubernetes/pkg/storage/etcd.(*etcdWatcher).decodeObject+0x69c k8s.io/kubernetes/pkg/storage/etcd.(*etcdWatcher).sendModify+0x171 k8s.io/kubernetes/pkg/storage/etcd.(*etcdWatcher).sendResult+0xde k8s.io/kubernetes/pkg/storage/etcd.(*etcdWatcher).translate+0x5d0 Note the call to sendModify() and subsequent addToCache() where etcd updated the Node and the watch cache is caching that Node before returning the response to the watcher. ROUTINE ======================== k8s.io/kubernetes/pkg/storage/etcd.(*etcdWatcher).sendModify 0 1.17GB (flat, cum) 54.74% of Total . . 370: return . . 371: } . . 372: if w.include != nil && !w.include(res.Node.Key) { . . 373: return . . 374: } . 1.17GB 375: curObj, err := w.decodeObject(res.Node) . . 376: if err != nil { An exacerbating factor is the number of images on the nodes which creates large Node structures: ROUTINE ======================== k8s.io/kubernetes/pkg/api.DeepCopy_api_NodeStatus . . 1567: if in.Images != nil { . . 1568: in, out := in.Images, &out.Images . 631.81MB 1569: *out = make([]ContainerImage, len(in)) . . 1570: for i := range in { . 153MB 1571: if err := DeepCopy_api_ContainerImage(in[i], &(*out)[i], c); err != nil { . . 1572: return err . . 1573: } . . 1574: } I don't see how this is possible, as the watch caching was only enabled 11 hours ago - https://github.com/openshift/origin/pull/8395 . This isn't related to the watch cache. This is a separate internal cache that etcdHelper uses when decoding list responses The objects in the decode cache [(*etcdHelper).addToCache] back into util/cache.go which is statically configured on startup, objects that are no longer referenced "should be cleaned" by GC, unless there is some dangling reference outside of the cache itself. I'm starting to think this is not a memory leak; just a very large cache. The etcd cache limit is 50k items across all resource types. For debugging, I reduced the limit to 100 items, so I could fill the cache in a reasonable time, and the number of inuse_objects allocated on the addToCache() path does stabilize, refuting the memory leak theory. The overall RSS for openshift does continue to grow, but mostly in the etcd server code (I assume it has its own cache that is growing) In other news, that cache is pretty poorly designed. It is mostly full of obsolete versions of resources and the eviction policy is not LRU. In some cases, I saw items added and then immediately evicted i.e. addToCache() would return success and the item would not be in the cache. At the very least, we should do the eviction before the add so that with addToCache() returns, the item is guaranteed to be in the cache. > In other news, that cache is pretty poorly designed. It is mostly full of
> obsolete versions of resources and the eviction policy is not LRU. In some
> cases, I saw items added and then immediately evicted i.e. addToCache() would
> return success and the item would not be in the cache.
Feel free to open an upstream item there.
I've prepped a test machine with the debug build that reduces the cache size from 50k to 500 and vlaad is going to run the tests again tomorrow. The expected result is that we will see a drop off in the RSS growth rate beyond when the cache fills to capacity. I opened this upstream to fix the glaring issue of immediate eviction: https://github.com/kubernetes/kubernetes/pull/24164 I am also going to open another issue for improving the cache effectiveness, which personally I think is very low right now. I think the cache would be just as effective if it were 1% the size in the typical case (500 instead of 50k). I assume reasoning was "1000 nodes with 50 pods each" was used to determine this value. However, in the case of smaller setups, it just wastes a lot of memory caching obsolete resources. The test is looking good. openshift master is ~400MB after 6hrs and doesn't seem to be growing like it did before. I'm going to work on PR for upstream kube that makes the cache size user configurable and a PR for origin that reduces the default size of the cache. Addition information, the addToCache() paths that accounted for over 1GB of allocations before currently only account for 3MB of usage in 8348 objects. Upstream PR for user configurable etcd cache size https://github.com/kubernetes/kubernetes/pull/24261 Looks like someone beat Seth to the punch - 24261 was closed as https://github.com/kubernetes/kubernetes/pull/23914 was opened first (still awaiting merge upstream). Too risky for 3.2. Will pick up in a future Kube rebase. https://github.com/kubernetes/kubernetes/pull/23914 was merged upstream to address this issue. Should be picked up in the 3.3 rebase. Moving back to POST until https://github.com/openshift/origin/pull/8856 is merged. Once it's merged, please move to ON_QA. Oh wait, this is for OSE. Please move to MODIFIED once it's merged. Origin rebase is complete. This has been merged and is in OSE v3.3.0.8 or newer. I have started another run of reliability tests on v3.3.0.8, will update the bug with my findings. We actually haven't changed anything about the cache size in OpenShift. All we have is the *ability* to change it, but we haven't done so. *** Bug 1357057 has been marked as a duplicate of this bug. *** Appropriate sizing should also be noted: https://docs.openshift.org/latest/install_config/install/prerequisites.html Docs PR: https://github.com/openshift/openshift-docs/pull/2542. I'm going to leave this ON_QA since the cache size can be changed as per the above PR. 3.3 Run update : - Default cache size is 50000, after running the tests for 10 days memory consumption increased to 4.6G by openshift master process. - Added following setting in master-config to 1000 kubernetesMasterConfig: admissionConfig: pluginConfig: {} apiServerArguments: cloud-config: - /etc/origin/cloudprovider/aws.conf cloud-provider: - aws deserialization-cache-size: - "1000" - Also enabled pprof for master process on the cluster, will monitor for few more days. Andy, Should we create a bug to reduce default cache size to lower number or default 50K is right number? Let's create a new bug to lower the default value, and we can argue in that bug what the value should actually be :-) Created 2 different bugs for 1. Changing default number to something lower then 5K 2. Still memory usage found and created another bug https://bugzilla.redhat.com/show_bug.cgi?id=1371985 Making this bug verified as the the number of objects can be configured in master-config. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1933 |
Created attachment 1143369 [details] pidstat Description of problem: Memory utilization by Openshift master process increasing continously when running Application Reliability tests. These tests create a bunch of applications (10 in this case) in the beginning and keep scaling/building/accessing them over the period of time. These applications are created using templates (eap64-mysql-s2i, dancer-mysql-example, cakephp-mysql). These tests were started on 3/29/2016 and they are still running, please see the graph attached, pidstat by pbench. Here is the cluster information 1 Master and 2 Nodes created in AWS m4.xlarge Version-Release number of selected component (if applicable): How reproducible: Running reliability tests for few days. Steps to Reproduce: 1. Create a cluster in AWS 1 Master, 2 Nodes 2. Create 10 applications using different templates 3. Keep scaling up/down, building, accessing them for few days, watch memory utilization on Master node. Actual results: Increasing memory consumption by openshift master process Expected results: Memory consumption should be stable after few hours Additional info: