Bug 1588244
| Summary: | Slow memory leak in controllers process | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Master | Assignee: | Michal Fojtik <mfojtik> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | zhou ying <yinzhou> |
| Severity: | high | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 3.10.0 | CC: | aos-bugs, ccoleman, jokerman, mifiedle, mmccomas, scuppett, ssorce, wsun |
| Target Milestone: | --- | ||
| Target Release: | 3.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-12-20 21:37:06 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Clayton Coleman
2018-06-07 00:27:10 UTC
Does this cluster have very many rolebindings ? Looking at the code you pointed out I think you are simply looking at the AuthorizationCache. I am not aware of any changes in this area from 3.9... Forget about the metrics I posted, it was captured from apiserver, not controllers (just learned I used wrong entrypoint...) Clayton, I can still see a little grow in the controllers, however when I took the heap dump this morning, the Sprintf moved down: 3154342134 13.69% 13.69% 3154342134 13.69% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).GetOwnerReferences 2798705553 12.14% 25.83% 5952621675 25.83% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/apis/meta/v1.GetControllerOf 2727453329 11.83% 37.66% 3631254381 15.75% github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume/util/operationexecutor.generateVolumeMsgDetailed 2112263143 9.16% 46.82% 2121889210 9.21% fmt.Sprintf 1810230245 7.85% 54.68% 6342761169 27.52% github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume/util/operationexecutor.(*VolumeToAttach).GenerateMsgDetailed 904239872 3.92% 58.60% 904239872 3.92% github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/attachdetach/cache.getPodsFromMap 808289462 3.51% 62.11% 808289462 3.51% reflect.unsafe_New I guess we need couple more hours to tell the difference in allocated objects. Moving this out from 3.10 blocker list. I will fill a kube issue about statefulset handling sets with 200000 replicas. Setting this back to blocker as the memory grow is faster than we thought and we need to figure out where we leaking. fixed in 3.10.6-1
test scenario:
rapidly create and delete pods containing tolerations. for example, repeatedly create 1000 of these in a namespace, then delete all pods in the namespace:
apiVersion: v1
kind: Pod
metadata:
generateName: test-pod-
spec:
containers:
- image: busybox
name: busybox
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
prior to https://github.com/openshift/origin/pull/20071, memory of the controllers process will grow steadily (local tests showed ~250MB growth in 20 minutes)
Hi Mike Fiedler:
Tested with openshift v3.10.7, memory of the controllers pod grows steadily.
Steps:
1) Create more than 1000 of pods in the namespace,with file:
apiVersion: v1
kind: Pod
metadata:
annotations:
generateName: test-pod-
labels:
name: hello-openshift
spec:
containers:
- image: aosqe/hello-openshift
imagePullPolicy: IfNotPresent
tolerations:
- effect: NoSchedule
key: key1
operator: Equal
value: value1
- effect: NoSchedule
key: key2
operator: Equal
value: value2
2) Inspect the memory of controllers pod;
3) Delete all the pods in the namespace;
4) Recreate the 1000 pods again;
5) Check the memory of controllers pod;
Result:
The controllers pod 'MEM USAGE' will slow down when all pods deleted. When 1000 pods created , the largest 'MEM USAGE' is about 80MB growth.
Note:
In my env, only has about 250 pods Running on one node, other were pending.But I think this is enough for the controller test. What do you think @ mifiedle ?
Hi Mike Fiedler:
Tested with openshift v3.10.7, memory of the controllers pod grows steadily.
Steps:
1) Create more than 1000 of pods in the namespace,with file:
apiVersion: v1
kind: Pod
metadata:
annotations:
generateName: test-pod-
labels:
name: hello-openshift
spec:
containers:
- image: aosqe/hello-openshift
imagePullPolicy: IfNotPresent
tolerations:
- effect: NoSchedule
key: key1
operator: Equal
value: value1
- effect: NoSchedule
key: key2
operator: Equal
value: value2
2) Inspect the memory of controllers pod;
3) Delete all the pods in the namespace;
4) Recreate the 1000 pods again;
5) Check the memory of controllers pod;
Result:
The controllers pod 'MEM USAGE' will slow down when all pods deleted. When 1000 pods created , the largest 'MEM USAGE' is about 80MB growth.
Note:
In my env, only has about 250 pods Running on one node, other were pending.But I think this is enough for the controller test. What do you think @ mifiedle ?
Couldn't reproduce it in our long running environment, so verify it. |