Bug 1986452

Summary: Increase in RSS memory in CRI-O
Product: OpenShift Container Platform Reporter: Raul Sevilla <rsevilla>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: MinLi <minmli>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, kgordeev, minmli, nagrawal, rphillips, rsandu, rsevilla, schoudha, ssonigra, wking
Version: 4.9Keywords: Performance, TestBlocker
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-49
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:42:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2002805    

Description Raul Sevilla 2021-07-27 15:03:45 UTC
Hi,

During our performance benchmarks, we've found out a remarkable increase in the Resident Set Size memory usage by kubelet and CRI-O components.



4.8 report shows a max RSS memory of 667 MiB and 572 MiB from CRI-O and Kubelet respectively

Link: http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz/kube-burner-report?orgId=1&from=1617632003313&to=1617634683027&var-Datasource=ripsaw-kube-burner-internal-production&var-sdn=openshift-sdn&var-job=node-density-heavy&var-uuid=809a6654-96dd-42ea-b899-50c71836ae7f&var-master=All&var-worker=ip-10-0-129-34.us-west-2.compute.internal&var-infra=ip-10-0-132-49.us-west-2.compute.internal&var-namespace=All&theme=light




4.9.0-0.nightly-2021-07-21-021707 report 1255 MiB and 961 MiB from CRI-O and Kubelet respectively (Which is much higher than the default SYSTEM_RESERVED_MEMORY of 1GiB configured in kubelet)

link -> http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz/kube-burner-report?orgId=1&from=1627381755305&to=1627383650780&var-Datasource=ripsaw-kube-burner-public-production&var-sdn=All&var-job=node-density-heavy&var-uuid=a38939a7-cb8f-4466-9d16-9a9dc3d94d87&var-master=All&var-worker=ip-10-0-129-34.us-west-2.compute.internal&var-infra=ip-10-0-138-187.us-west-2.compute.internal&var-namespace=All



The scale test executed was node-density-heavy on a 25 worker node cluster, find more info about this test at https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner#node-density-and-node-density-heavy-variables



After some investigation I found the following:

CRI-O RSS memory reported in one of the worker nodes was 982484 KiB here

sh-4.4# ps -p 1392 -o pid,rss,vsz,cmd 
    PID   RSS    VSZ CMD
   1392 982484 8128132 /usr/bin/crio


# Analyze heap pprof

sh-4.4# curl --unix-socket /var/run/crio/crio.sock http://localhost/debug/pprof/heap > heap

sh-4.4# go tool pprof heap 
File: crio
Build ID: a31a3de0d4273355834c5a76b6d82a1f63f20d2e
Type: inuse_space
Time: Jul 27, 2021 at 2:55pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 279.99MB, 91.79% of 305.02MB total
Dropped 126 nodes (cum <= 1.53MB)
Showing top 10 nodes out of 91
      flat  flat%   sum%        cum   cum%
  158.51MB 51.97% 51.97%   158.51MB 51.97%  fmt.Sprintf
   60.71MB 19.90% 71.87%    60.71MB 19.90%  github.com/cri-o/cri-o/vendor/github.com/opencontainers/runtime-tools/generate.(*Generator).addEnv
      12MB  3.93% 75.80%       12MB  3.93%  github.com/cri-o/cri-o/vendor/github.com/rjeczalik/notify.watchpoint.Add
   11.50MB  3.77% 79.57%    11.50MB  3.77%  github.com/cri-o/cri-o/vendor/github.com/rjeczalik/notify.newnode (inline)
    8.14MB  2.67% 82.24%    19.64MB  6.44%  github.com/cri-o/cri-o/vendor/github.com/rjeczalik/notify.node.addchild
    7.50MB  2.46% 84.70%     7.50MB  2.46%  strings.(*Builder).grow
    6.50MB  2.13% 86.83%       13MB  4.26%  github.com/cri-o/cri-o/vendor/github.com/tchap/go-patricia/patricia.NewTrie
    6.50MB  2.13% 88.97%     6.50MB  2.13%  github.com/cri-o/cri-o/vendor/github.com/tchap/go-patricia/patricia.newSparseChildList (inline)
    4.50MB  1.48% 90.44%     4.50MB  1.48%  path.Join
    4.13MB  1.35% 91.79%     4.13MB  1.35%  github.com/cri-o/cri-o/vendor/github.com/rjeczalik/notify.(*nonrecursiveTree).walkWatchpoint


According to the heap memory profile above, the actual memory usage from CRI-O was 305.02MiB, however the OS is reporting a RSS of more than 959MiB. It seems like this difference is not being released back to the OS. (Same behaviour also happens in kubelet)

The environment variables this CRI-O process is using are the following

sh-4.4# cat /proc/1392/environ | tr '\0' '\n'
GODEBUG=x509ignoreCN=0,madvdontneed=1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
NOTIFY_SOCKET=/run/systemd/notify
INVOCATION_ID=86d42f7cd2d14bcd890927d2ce43f977
JOURNAL_STREAM=9:32032
GOTRACEBACK=crash
ENABLE_PROFILE_UNIX_SOCKET=true

Comment 1 Peter Hunt 2021-07-28 13:29:56 UTC
I have some suspicions of where this regression came from, but it would be helpful to confirm. Can you try this test with the latest 4.8 nightly (4.8.2 should work as well).

Comment 6 Peter Hunt 2021-08-25 14:15:14 UTC
I believe the CRI-O portion of this bug will be addressed with the attached PR. I will need help verifying that, though.

Comment 9 MinLi 2021-09-02 10:10:32 UTC
Hi, Raul Sevilla
Can you help to check the data of 4.9 nightly build in recent 2 days? such as 4.9.0-0.nightly-2021-09-01-193941 or 4.9.0-0.nightly-2021-08-31-171539.
Thanks

Comment 10 Sunil Choudhary 2021-09-02 10:42:05 UTC
I did a comparison between 4.8 and 4.9 nightly. Memory usage is in similar range.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-09-01-001821   True        False         63m     Cluster version is 4.8.0-0.nightly-2021-09-01-001821

$ oc get nodes
NAME                                          STATUS   ROLES    AGE   VERSION
ip-10-0-130-99.ap-south-1.compute.internal    Ready    worker   86m   v1.21.1+9807387
ip-10-0-135-192.ap-south-1.compute.internal   Ready    master   96m   v1.21.1+9807387
ip-10-0-170-110.ap-south-1.compute.internal   Ready    master   96m   v1.21.1+9807387
ip-10-0-180-143.ap-south-1.compute.internal   Ready    worker   87m   v1.21.1+9807387
ip-10-0-218-23.ap-south-1.compute.internal    Ready    master   96m   v1.21.1+9807387
ip-10-0-219-43.ap-south-1.compute.internal    Ready    worker   87m   v1.21.1+9807387

$ oc debug node/ip-10-0-130-99.ap-south-1.compute.internal
Starting pod/ip-10-0-130-99ap-south-1computeinternal-debug ...
...

sh-4.4# ps -p 1317 -o pid,rss,vsz,cmd
    PID   RSS    VSZ CMD
   1317 110240 2023968 /usr/bin/crio

sh-4.4# ps -p 1354 -o pid,rss,vsz,cmd
    PID   RSS    VSZ CMD
   1354 160648 1940516 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --co


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-01-193941   True        False         125m    Cluster version is 4.9.0-0.nightly-2021-09-01-193941

$ oc get nodes
NAME                                          STATUS   ROLES    AGE    VERSION
ip-10-0-134-113.ap-south-1.compute.internal   Ready    worker   157m   v1.22.0-rc.0+bbcc9ae
ip-10-0-136-105.ap-south-1.compute.internal   Ready    master   168m   v1.22.0-rc.0+bbcc9ae
ip-10-0-170-97.ap-south-1.compute.internal    Ready    master   168m   v1.22.0-rc.0+bbcc9ae
ip-10-0-183-52.ap-south-1.compute.internal    Ready    worker   157m   v1.22.0-rc.0+bbcc9ae
ip-10-0-193-186.ap-south-1.compute.internal   Ready    worker   156m   v1.22.0-rc.0+bbcc9ae
ip-10-0-201-41.ap-south-1.compute.internal    Ready    master   168m   v1.22.0-rc.0+bbcc9ae

$ oc debug node/ip-10-0-134-113.ap-south-1.compute.internal
Starting pod/ip-10-0-134-113ap-south-1computeinternal-debug ...
...

sh-4.4# ps -p 1295 -o pid,rss,vsz,cmd 
    PID   RSS    VSZ CMD
   1295 111392 2028600 /usr/bin/crio

sh-4.4# ps -p 1324 -o pid,rss,vsz,cmd 
    PID   RSS    VSZ CMD
   1324 210332 2057152 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfi

Comment 11 Raul Sevilla 2021-09-03 12:00:37 UTC
(In reply to Peter Hunt from comment #6)
> I believe the CRI-O portion of this bug will be addressed with the attached
> PR. I will need help verifying that, though.

I have done some scale tests with a cluster using the PR you mention, I've seen some improvement (Top cri-o consuming pod was 1.132 GiB), however is still much higher compared with what reported in 4.8

On the other hand, while investigating this issue I realized that we did some changes to our workload (node-density-heavy) on 14 April, those changes consisted of adding an extra liveness probe to some of the pods deployed by this workload, that explains the increase of resource usage in kubelet -> https://github.com/cloud-bulldozer/benchmark-operator/commit/550ac7ec4107181f951cb31aa6a0a0138a884f78, I have done some tests w/o this code change and kubelet resident size usage is similar to 4.8 results

I'm still investigating in what point (release) this RSS increase in cri-o happened.

Comment 14 errata-xmlrpc 2021-10-18 17:42:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 15 Peter Hunt 2021-11-02 17:30:06 UTC
*** Bug 2014432 has been marked as a duplicate of this bug. ***