Bug 1986452
| Summary: | Increase in RSS memory in CRI-O | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Raul Sevilla <rsevilla> |
| Component: | Node | Assignee: | Peter Hunt <pehunt> |
| Node sub component: | CRI-O | QA Contact: | MinLi <minmli> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | aos-bugs, kgordeev, minmli, nagrawal, rphillips, rsandu, rsevilla, schoudha, ssonigra, wking |
| Version: | 4.9 | Keywords: | Performance, TestBlocker |
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | aos-scalability-49 | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-18 17:42:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2002805 | ||
|
Description
Raul Sevilla
2021-07-27 15:03:45 UTC
I have some suspicions of where this regression came from, but it would be helpful to confirm. Can you try this test with the latest 4.8 nightly (4.8.2 should work as well). Hey Peter, find here 4.8.0-0.nightly-2021-06-02-025513 data here http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz/kube-burner-report?orgId=1&from=1622640729984&to=1622644648232&var-Datasource=ripsaw-kube-burner-public-production&var-sdn=openshift-sdn&var-job=node-density-heavy&var-uuid=8b2f7c71-e1c9-4114-a5a9-c72d4707258e&var-master=All&var-worker=ip-10-0-129-99.us-west-2.compute.internal&var-infra=ip-10-0-152-100.us-west-2.compute.internal&var-namespace=All Memory usage is high as well 4.8.0-0.nightly-2021-05-10-092939 results are not showing such a big rss memory usage http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz/kube-burner-report?orgId=1&from=1621009275453&to=1621013141814&var-Datasource=ripsaw-kube-burner-public-production&var-sdn=openshift-sdn&var-job=node-density-heavy&var-uuid=7ccb7d75-7a94-44e7-a1eb-32e994bf2426&var-master=All&var-worker=ip-10-0-130-169.us-west-2.compute.internal&var-infra=ip-10-0-137-125.us-west-2.compute.internal&var-namespace=All I believe the CRI-O portion of this bug will be addressed with the attached PR. I will need help verifying that, though. Hi, Raul Sevilla Can you help to check the data of 4.9 nightly build in recent 2 days? such as 4.9.0-0.nightly-2021-09-01-193941 or 4.9.0-0.nightly-2021-08-31-171539. Thanks I did a comparison between 4.8 and 4.9 nightly. Memory usage is in similar range.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.0-0.nightly-2021-09-01-001821 True False 63m Cluster version is 4.8.0-0.nightly-2021-09-01-001821
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-130-99.ap-south-1.compute.internal Ready worker 86m v1.21.1+9807387
ip-10-0-135-192.ap-south-1.compute.internal Ready master 96m v1.21.1+9807387
ip-10-0-170-110.ap-south-1.compute.internal Ready master 96m v1.21.1+9807387
ip-10-0-180-143.ap-south-1.compute.internal Ready worker 87m v1.21.1+9807387
ip-10-0-218-23.ap-south-1.compute.internal Ready master 96m v1.21.1+9807387
ip-10-0-219-43.ap-south-1.compute.internal Ready worker 87m v1.21.1+9807387
$ oc debug node/ip-10-0-130-99.ap-south-1.compute.internal
Starting pod/ip-10-0-130-99ap-south-1computeinternal-debug ...
...
sh-4.4# ps -p 1317 -o pid,rss,vsz,cmd
PID RSS VSZ CMD
1317 110240 2023968 /usr/bin/crio
sh-4.4# ps -p 1354 -o pid,rss,vsz,cmd
PID RSS VSZ CMD
1354 160648 1940516 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --co
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.9.0-0.nightly-2021-09-01-193941 True False 125m Cluster version is 4.9.0-0.nightly-2021-09-01-193941
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-134-113.ap-south-1.compute.internal Ready worker 157m v1.22.0-rc.0+bbcc9ae
ip-10-0-136-105.ap-south-1.compute.internal Ready master 168m v1.22.0-rc.0+bbcc9ae
ip-10-0-170-97.ap-south-1.compute.internal Ready master 168m v1.22.0-rc.0+bbcc9ae
ip-10-0-183-52.ap-south-1.compute.internal Ready worker 157m v1.22.0-rc.0+bbcc9ae
ip-10-0-193-186.ap-south-1.compute.internal Ready worker 156m v1.22.0-rc.0+bbcc9ae
ip-10-0-201-41.ap-south-1.compute.internal Ready master 168m v1.22.0-rc.0+bbcc9ae
$ oc debug node/ip-10-0-134-113.ap-south-1.compute.internal
Starting pod/ip-10-0-134-113ap-south-1computeinternal-debug ...
...
sh-4.4# ps -p 1295 -o pid,rss,vsz,cmd
PID RSS VSZ CMD
1295 111392 2028600 /usr/bin/crio
sh-4.4# ps -p 1324 -o pid,rss,vsz,cmd
PID RSS VSZ CMD
1324 210332 2057152 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfi
(In reply to Peter Hunt from comment #6) > I believe the CRI-O portion of this bug will be addressed with the attached > PR. I will need help verifying that, though. I have done some scale tests with a cluster using the PR you mention, I've seen some improvement (Top cri-o consuming pod was 1.132 GiB), however is still much higher compared with what reported in 4.8 On the other hand, while investigating this issue I realized that we did some changes to our workload (node-density-heavy) on 14 April, those changes consisted of adding an extra liveness probe to some of the pods deployed by this workload, that explains the increase of resource usage in kubelet -> https://github.com/cloud-bulldozer/benchmark-operator/commit/550ac7ec4107181f951cb31aa6a0a0138a884f78, I have done some tests w/o this code change and kubelet resident size usage is similar to 4.8 results I'm still investigating in what point (release) this RSS increase in cri-o happened. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 *** Bug 2014432 has been marked as a duplicate of this bug. *** |