Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1588244

Summary: Slow memory leak in controllers process
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED CURRENTRELEASE QA Contact: zhou ying <yinzhou>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.10.0CC: aos-bugs, ccoleman, jokerman, mifiedle, mmccomas, scuppett, ssorce, wsun
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-20 21:37:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2018-06-07 00:27:10 UTC
https://www.dropbox.com/s/cvvrwafm106wu1v/Screenshot%202018-06-06%2020.26.11.png?dl=0

ca-central-1 controllers show a slow non-terminating growth in memory use.  Needs investigation.

Must be triaged for 3.10.

Comment 5 Simo Sorce 2018-06-07 11:56:08 UTC
Does this cluster have very many rolebindings ?
Looking at the code you pointed out I think you are simply looking at the AuthorizationCache.
I am not aware of any changes in this area from 3.9...

Comment 6 Michal Fojtik 2018-06-07 13:18:04 UTC
Forget about the metrics I posted, it was captured from apiserver, not controllers (just learned I used wrong entrypoint...)

Comment 8 Michal Fojtik 2018-06-08 06:25:49 UTC
Clayton, I can still see a little grow in the controllers, however when I took the heap dump this morning, the Sprintf moved down:

3154342134 13.69% 13.69% 3154342134 13.69%  github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).GetOwnerReferences
2798705553 12.14% 25.83% 5952621675 25.83%  github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/apis/meta/v1.GetControllerOf
2727453329 11.83% 37.66% 3631254381 15.75%  github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume/util/operationexecutor.generateVolumeMsgDetailed
2112263143  9.16% 46.82% 2121889210  9.21%  fmt.Sprintf
1810230245  7.85% 54.68% 6342761169 27.52%  github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume/util/operationexecutor.(*VolumeToAttach).GenerateMsgDetailed
 904239872  3.92% 58.60%  904239872  3.92%  github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/attachdetach/cache.getPodsFromMap
 808289462  3.51% 62.11%  808289462  3.51%  reflect.unsafe_New

I guess we need couple more hours to tell the difference in allocated objects.

Moving this out from 3.10 blocker list. I will fill a kube issue about statefulset handling sets with 200000 replicas.

Comment 10 Michal Fojtik 2018-06-20 14:03:59 UTC
Setting this back to blocker as the memory grow is faster than we thought and we need to figure out where we leaking.

Comment 20 Jordan Liggitt 2018-06-22 18:20:45 UTC
fixed in 3.10.6-1


test scenario:
rapidly create and delete pods containing tolerations. for example, repeatedly create 1000 of these in a namespace, then delete all pods in the namespace:

apiVersion: v1
kind: Pod
metadata:
  generateName: test-pod-
spec:
  containers:
  - image: busybox
    name: busybox
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists

prior to https://github.com/openshift/origin/pull/20071, memory of the controllers process will grow steadily (local tests showed ~250MB growth in 20 minutes)

Comment 21 zhou ying 2018-06-25 08:21:47 UTC
Hi Mike Fiedler:

Tested with openshift v3.10.7, memory of the controllers pod grows steadily.

Steps:
1) Create more than 1000 of pods in the namespace,with file:
apiVersion: v1
kind: Pod
metadata:
  annotations:
  generateName: test-pod-
  labels:
    name: hello-openshift
spec:
  containers:
  - image: aosqe/hello-openshift
    imagePullPolicy: IfNotPresent
  tolerations:
  - effect: NoSchedule
    key: key1
    operator: Equal
    value: value1
  - effect: NoSchedule
    key: key2
    operator: Equal
    value: value2
2) Inspect the memory of controllers pod;
3) Delete all the pods in the namespace;
4) Recreate the 1000 pods again;
5) Check the memory of controllers pod;

Result:
   The controllers pod 'MEM USAGE' will slow down when all pods deleted. When 1000 pods created , the largest 'MEM USAGE' is about 80MB growth.


Note:
    In my env, only has about 250 pods Running on one node, other were pending.But I think this is enough for the controller test. What do you think @ mifiedle ?

Comment 22 zhou ying 2018-06-25 08:22:45 UTC
Hi Mike Fiedler:

Tested with openshift v3.10.7, memory of the controllers pod grows steadily.

Steps:
1) Create more than 1000 of pods in the namespace,with file:
apiVersion: v1
kind: Pod
metadata:
  annotations:
  generateName: test-pod-
  labels:
    name: hello-openshift
spec:
  containers:
  - image: aosqe/hello-openshift
    imagePullPolicy: IfNotPresent
  tolerations:
  - effect: NoSchedule
    key: key1
    operator: Equal
    value: value1
  - effect: NoSchedule
    key: key2
    operator: Equal
    value: value2
2) Inspect the memory of controllers pod;
3) Delete all the pods in the namespace;
4) Recreate the 1000 pods again;
5) Check the memory of controllers pod;

Result:
   The controllers pod 'MEM USAGE' will slow down when all pods deleted. When 1000 pods created , the largest 'MEM USAGE' is about 80MB growth.


Note:
    In my env, only has about 250 pods Running on one node, other were pending.But I think this is enough for the controller test. What do you think @ mifiedle ?

Comment 23 zhou ying 2018-07-03 01:07:03 UTC
Couldn't reproduce it in our long running environment, so verify it.