Bug 1855067 - nproc * 1/2 MiB of each container memory may be taken by RHEL8 kernel slabs
Summary: nproc * 1/2 MiB of each container memory may be taken by RHEL8 kernel slabs
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 4.4
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Michael Burke
QA Contact: Weinan Liu
Vikram Goyal
URL:
Whiteboard:
Depends On:
Blocks: 1186913 1862877 1892720
TreeView+ depends on / blocked
 
Reported: 2020-07-08 20:11 UTC by jteagno
Modified: 2023-12-15 18:25 UTC (History)
31 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
The memory consumption of a container on nodes with many CPUs may seem higher than one would expect. This is because the current RHEL kernel creates per-cgroup specific kmemcaches for each memory cgroup. All these new kmemcaches consume kernel memory. For added performance, each of these kmemcaches contain per-cpu caches as well as per-numa-node caches. Thus, the amount of memory stored in those caches is also proportional to the number of CPUs the system has. The more CPUs a system have, the more kernel memory will be held up in those caches. As a result, $(nproc) * 1/2 MiB of memory may be taken by kernel slabs, and this can result in containers going over their memory limits and being killed. It is recommended containers be given at least that amount of memory so they're not spuriously killed. This is not a new issue, but was highlighted by adding support for CNV, as one of the CNV containers had a relatively low request when compared with the equation above. Increasing the memory fixed the issue.
Clone Of:
: 1862877 1892720 (view as bug list)
Environment:
Last Closed: 2020-12-03 03:28:40 UTC
Target Upstream Version:
Embargoed:
mokumar: needinfo-
mokumar: needinfo-
mokumar: needinfo-


Attachments (Terms of Use)
analysis of vmcore in comment #106 (83.91 KB, text/plain)
2020-09-10 16:20 UTC, Andrea Arcangeli
no flags Details

Description jteagno 2020-07-08 20:11:19 UTC
On OpenShift 4.4.5 with openshift-cnv 2.3.0 , the `volumecontainerbootdisk` container of a `virt-launcher-` pod is killed by the out-of-memory killer.

The node has sufficient allocatable memory; it appears that the `volumecontainerbootdisk` container is exceeding its memory limit of 40 MB.

Our understanding is that openshift-cnv is responsible for setting this memory limit.

$ oc -n MY-NAMESPACE get pod virt-launcher-FOO-BAR -o json | jq '.status.containerStatuses[] | select(.name=="volumecontainerbootdisk") | .state'
{
  "terminated": {
    "containerID": "cri-o://ed5bb05fbc30992202322cab11ee936974d81b680388f6e5db16e5b944c60bad",
    "exitCode": 137,
    "finishedAt": "2020-07-06T16:23:09Z",
    "reason": "OOMKilled",
    "startedAt": "2020-06-29T18:28:50Z"
  }
}

$ oc get node/worker-11 -o json | jq -r .status.allocatable.memory
526669684Ki

$ oc -n MY-NAMESPACE get pod virt-launcher-FOO-BAR -o json | jq '.spec.containers[] | select(.name=="volumecontainerbootdisk") | .resources'
{
  "limits": {
    "cpu": "100m",
    "memory": "40M"
  },
  "requests": {
    "cpu": "10m",
    "memory": "1M"
  }
}

Comment 1 jteagno 2020-07-08 20:16:52 UTC
confirmation of CNV version:

$ oc -n openshift-cnv get csv/kubevirt-hyperconverged-operator.v2.3.0
NAME                                      DISPLAY                           VERSION   REPLACES   PHASE
kubevirt-hyperconverged-operator.v2.3.0   Container-native virtualization   2.3.0                Succeeded

Comment 19 Fabian Deutsch 2020-07-14 12:19:27 UTC
Audrey, can we add a release not for this issue?

ROman, can you please provide the "Doc Text" field value?

Comment 22 jteagno 2020-07-16 13:48:47 UTC
Hello!  Below is a summary of the current state:

(1) All of the openshift-cnv container images in use have a "version" label of "v2.3.0".

(2) The affected RHCOS nodes have the following version and kernel:

$ cat /etc/redhat-release 
Red Hat Enterprise Linux CoreOS release 4.4

$ uname -a
Linux worker-11 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Wed Feb 26 03:08:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

(3) The memory map of the process inside the `volumecontainerdisk` container matches the expected value for openshift-cnv 2.3.0 -- total size 1116 KiB.

This "1116 KiB" size remains constant, even as the memory usage reported by `oc adm top poc XXX --containers` grows.

$ oc -n jteagno-bug-1855067 get pod virt-launcher-example-4b57p -o json | jq -r '.spec.nodeName, (.status.containerStatuses[] | select(.name=="volumecontainerdisk").containerID)'
worker-XX
cri-o://0c481cc95c61179637185c48214a90119845b3c15c0876737c2abad42fa6eee8

[core@worker-XX ~]$ sudo crictl inspect 0c481cc95c61179637185c48214a90119845b3c15c0876737c2abad42fa6eee8 | jq -r .pid
628151

[core@worker-XX ~]$ sudo pmap -x 628151
628151:   /usr/bin/container-disk --copy-path /var/run/kubevirt-ephemeral-disks/container-disk-data/cbb4104a-3cf9-4f5a-8a99-db59b1238b32/disk_0
Address           Kbytes     RSS   Dirty Mode  Mapping
0000000000400000     792     704       0 r-x-- container-disk
00000000006c6000      24      24      16 rw--- container-disk
00000000006cc000       4       4       4 rw---   [ anon ]
00000000015f8000     140       8       8 rw---   [ anon ]
00007fff669f7000     132      12      12 rw---   [ stack ]
00007fff66aea000      12       0       0 r----   [ anon ]
00007fff66aed000       8       4       0 r-x--   [ anon ]
ffffffffff600000       4       0       0 r-x--   [ anon ]
---------------- ------- ------- -------
total kB            1116     756      40

(4) After starting a new VirtualMachineInstance, the memory usage of the virt-launcher pod's `volumecontainerdisk` container grows from 7 MiB to 10 MiB over four minutes (as seen by `oc adm top pod XXX --containers`).

### Metrics become available after one minute, and the `volumecontainerdisk` container consumes 7 MiB

$ oc -n jteagno-bug-1855067 get pod
NAME                          READY   STATUS    RESTARTS   AGE
virt-launcher-example-4b57p   2/2     Running   0          64s

$ oc -n jteagno-bug-1855067 adm top pod --containers
POD                           NAME                  CPU(cores)   MEMORY(bytes)
virt-launcher-example-4b57p   compute               1m           215Mi
virt-launcher-example-4b57p   volumecontainerdisk   1m           7Mi

### After 90 seconds, the memory usage is 9 MiB

$ oc -n jteagno-bug-1855067 get pod
NAME                          READY   STATUS    RESTARTS   AGE
virt-launcher-example-4b57p   2/2     Running   0          96s

$ oc -n jteagno-bug-1855067 adm top pod --containers
POD                           NAME                  CPU(cores)   MEMORY(bytes)
virt-launcher-example-4b57p   compute               2m           212Mi
virt-launcher-example-4b57p   volumecontainerdisk   2m           9Mi

### After almost 4 minutes, the memory usage is 10 MiB

$ oc -n jteagno-bug-1855067 get pod
NAME                          READY   STATUS    RESTARTS   AGE
virt-launcher-example-4b57p   2/2     Running   0          3m49s

$ oc -n jteagno-bug-1855067 adm top pod --containers
POD                           NAME                  CPU(cores)   MEMORY(bytes)
virt-launcher-example-4b57p   compute               6m           218Mi
virt-launcher-example-4b57p   volumecontainerdisk   5m           10Mi

Comment 24 Peter Hunt 2020-07-16 18:12:28 UTC
is there a limit on both of the containers of the pod? what's the full pod spec (including the container definitions)

Comment 25 jteagno 2020-07-16 18:42:08 UTC
Shown below are (a) the limits and (b) the memory usage from the systemd crio-FOO.scope unit.  These were captured on a virt-launcher pod that has not yet been OOMKilled:

$ oc -n MY-NAMESPACE get pod virt-launcher-FOO.BAR -o json | jq '.spec.containers[] | select(.name=="volumecontainerbootdisk") | .resources'
{
  "limits": {
    "cpu": "100m",
    "memory": "40M"
  },
  "requests": {
    "cpu": "10m",
    "memory": "1M"
  }
}

$ oc -n MY-NAMESPACE get pod virt-launcher-FOO.BAR -o json | jq '.spec.containers[] | select(.name=="compute") | .resources'
{
  "limits": {
    "cpu": "3",
    "devices.kubevirt.io/kvm": "1",
    "devices.kubevirt.io/tun": "1",
    "devices.kubevirt.io/vhost-net": "1",
    "memory": "10413943040"
  },
  "requests": {
    "cpu": "2",
    "devices.kubevirt.io/kvm": "1",
    "devices.kubevirt.io/tun": "1",
    "devices.kubevirt.io/vhost-net": "1",
    "memory": "2048M"
  }
}

$ oc -n MY-NAMESPACE adm top pod virt-launcher-FOO.BAR --containers
POD                     NAME                      CPU(cores)   MEMORY(bytes)   
virt-launcher-FOO.BAR   volumecontainerbootdisk   7m           27Mi            
virt-launcher-FOO.BAR   compute                   31m          1120Mi 

$ oc -n MY-NAMESPACE get pod virt-launcher-FOO.BAR -o json | jq -r '.spec.nodeName, (.status.containerStatuses[] | select(.name=="volumecontainerbootdisk") | .containerID)'
worker-16
cri-o://8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae

[core@worker-16 ~]$ sudo systemctl status crio-8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae.scope
● crio-8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae.scope - libcontainer container 8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae
   Loaded: loaded (/run/systemd/transient/crio-8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae.scope; transient)
Transient: yes
   Active: active (running) since Mon 2020-06-29 18:28:41 UTC; 2 weeks 3 days ago
    Tasks: 1 (limit: 1024)
   Memory: 27.8M (limit: 38.1M)
      CPU: 3h 7min 52.786s
   CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod141b9f46_0c6c_43d0_96af_12159795acd8.slice/crio-8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae.scope
           └─119683 /usr/bin/container-disk --copy-path /var/run/kubevirt-ephemeral-disks/container-disk-data/41964a28-fefa-4d0d-9ea3-504ee40c7d1d/disk_0

Comment 28 Lavanya Mandavilli 2020-07-17 21:13:31 UTC
@jteagno, @fabiand - Requesting doc text for this defect so that I can add it to the 2.4 CNV release notes.

Comment 29 Roman Mohr 2020-07-21 08:24:03 UTC
Started a new run on different test pods to further isolate the issue:

```
$ oc get pods
NAME                                                     READY   STATUS    RESTARTS   AGE
testpod-containerdisk-and-exec                           1/1     Running   0          10m # containerdisk includeing exec probe like used in cnv
testpod-containerdisk-and-exec-no-limit                  1/1     Running   0          10m # containerdisk with exec and not limit set
testpod-containerdisk-and-exec-with-sidecar              2/2     Running   0          10m # containerdisk with exec and a sidecar without limits
testpod-containerdisk-and-exec-with-sidecar-sleep-echo   2/2     Running   0          10m # like before but with sleep and echo commands to rule out the cnv binary
testpod-containerdisk-no-exec                            1/1     Running   0          10m # just the containerdisk binary without any exec probe
```

It will probably take at least until tomorrow until I can see trends since on my cluster the memory consumption used to grow way slower than on the user cluster.

Comment 30 Roman Mohr 2020-07-21 08:26:15 UTC
(In reply to Fabian Deutsch from comment #19)
> Audrey, can we add a release not for this issue?
> 
> ROman, can you please provide the "Doc Text" field value?

A reasonable doc text would look like this:

Due to an unknown reason the memory consumption vor containerDisks may slowly grow until it reaches its limit. This can take a few days or a few weeks until this issue is experienced. Restarting the VM resolves the issue.

Comment 31 Audrey Spaulding 2020-07-21 13:48:53 UTC
Lavanya is working on the release note.

Comment 32 Lavanya Mandavilli 2020-07-21 14:53:58 UTC
@rmohr, @fabiand - This PR is ready for review: https://github.com/openshift/openshift-docs/pull/23992
Please add your comments/edits.

Comment 33 Lavanya Mandavilli 2020-07-22 14:22:31 UTC
@rmohr - review input was incorporated. Please review and advise if it is LGTM and can move to peer review stage.

Comment 34 Lavanya Mandavilli 2020-07-22 14:54:37 UTC
SME review was completed. Moving the PR to peer review stage.

Comment 35 Fabian Deutsch 2020-07-23 13:53:04 UTC
Roman, what was the outcome of your investigation?

Comment 38 Lavanya Mandavilli 2020-07-23 14:54:46 UTC
@rmohr's review for the PR was completed with LGTM:
https://github.com/openshift/openshift-docs/pull/23992

With SME review completed, this PR is now in peer review stage.

@fabiand @rmohr - please move the defect to Verified stage. 
It is currently in ON-QA stage.

Comment 40 Lavanya Mandavilli 2020-07-23 16:22:53 UTC
Peer review was completed. @credi input was incorporated. PR was completed and merged:
https://github.com/openshift/openshift-docs/pull/23992

Comment 41 Roman Mohr 2020-07-27 09:09:15 UTC
(In reply to Roman Mohr from comment #29)
> Started a new run on different test pods to further isolate the issue:
> 
> ```
> $ oc get pods
> NAME                                                     READY   STATUS   
> RESTARTS   AGE
> testpod-containerdisk-and-exec                           1/1     Running   0
> 10m # containerdisk includeing exec probe like used in cnv
> testpod-containerdisk-and-exec-no-limit                  1/1     Running   0
> 10m # containerdisk with exec and not limit set
> testpod-containerdisk-and-exec-with-sidecar              2/2     Running   0
> 10m # containerdisk with exec and a sidecar without limits
> testpod-containerdisk-and-exec-with-sidecar-sleep-echo   2/2     Running   0
> 10m # like before but with sleep and echo commands to rule out the cnv binary
> testpod-containerdisk-no-exec                            1/1     Running   0
> 10m # just the containerdisk binary without any exec probe
> ```
> 
> It will probably take at least until tomorrow until I can see trends since
> on my cluster the memory consumption used to grow way slower than on the
> user cluster.

The outcome of this was sadly not unambiguous. Tests are running right now on two nodes again. On both we have this set of pods running, as well as three VMIs with a containerdisk, to see if it makes any difference if they are running as a side-car or as the main container.

Comment 42 Roman Mohr 2020-07-27 09:10:29 UTC
One interesting side-note: Israel has an openshift 4.5 cluster running for almost a month now. The VMIs with containerDisk there also don't show any symptoms.

Comment 43 Roman Mohr 2020-07-28 07:25:37 UTC
No difference after running the new extensive test suit on two nodes for 24 hours. The values are constantly within the expected range on 4.4.12.

Comment 48 Peter Hunt 2020-07-29 15:42:51 UTC
for reference, here are the differences in the machine-os-content between 4.4.5 and 4.4.12

https://releases-rhcos-art.cloud.privileged.psi.redhat.com/diff.html?first_release=44.81.202005180831-0&first_stream=releases%2Frhcos-4.4&second_release=44.81.202007070223-0&second_stream=releases%2Frhcos-4.4

I am currently looking through cri-o changes to see if anything pops out

Comment 51 Eric Paris 2020-07-29 20:40:24 UTC
I was thinking of https://bugzilla.redhat.com/show_bug.cgi?id=1571183 https://bugzilla.redhat.com/show_bug.cgi?id=1779325 and some other openssl + liveness probe problems. Where really the problem was kernel caching of negative dentries. So what are we doing in the liveness probe in question?

Comment 52 Eric Paris 2020-07-29 20:41:09 UTC
And https://bugzilla.redhat.com/show_bug.cgi?id=1699202

Comment 53 Peter Hunt 2020-07-29 21:18:06 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1699202#c75 shows a leak is fixed when files that are put in tmpfs's are removed (otherwise, they contribute to the cgroups memory count). Does the probe possibly create files in a tmpfs that aren't cleaned up after execution?

FWIW: I've already checked, and cri-o does not tell conmon to put a file in the exec-dir, so I do not think the same solution for podman will work for cri-o

It's also worth noting that the conmon used for exec is not in the container's cgroup. If anything, it's in the pod's, but I believe it's actually underneath cri-o's (cri-o doesn't move it). If we're seeing a leak similar to the one in podman, I believe it would be coming from the probe, or `runc exec`

Comment 54 Roman Mohr 2020-07-29 21:29:30 UTC
(In reply to Peter Hunt from comment #53)
> https://bugzilla.redhat.com/show_bug.cgi?id=1699202#c75 shows a leak is
> fixed when files that are put in tmpfs's are removed (otherwise, they
> contribute to the cgroups memory count). Does the probe possibly create
> files in a tmpfs that aren't cleaned up after execution?

It only checks if a socket exists and exits afterwards: https://github.com/kubevirt/kubevirt/blob/6f3f5b489a72c12bef026d6c2c3c90fcc63edbe8/cmd/container-disk-v2alpha/main.c#L147

> 
> FWIW: I've already checked, and cri-o does not tell conmon to put a file in
> the exec-dir, so I do not think the same solution for podman will work for
> cri-o
> 
> It's also worth noting that the conmon used for exec is not in the
> container's cgroup. If anything, it's in the pod's, but I believe it's
> actually underneath cri-o's (cri-o doesn't move it). If we're seeing a leak
> similar to the one in podman, I believe it would be coming from the probe,
> or `runc exec`

Comment 55 Peter Hunt 2020-07-30 14:33:31 UTC
given that, it does seem quite improbable that it is the container-disk binary...

We can try two different scenerios (if given sufficient access to a reproducing node)

1: we can try with crun instead of runc
2: we can try with a cri-o that runs the exec sync entirely internally (rather than exec'ing to conmon)

Comment 56 Peter Hunt 2020-07-30 14:35:02 UTC
it is unlikely we will find the culprit this sprint. but the search continues!

Comment 57 Peter Hunt 2020-08-04 16:41:47 UTC
I have been working on a script[1] that measures the rss usage in crio when running a lot of execs.

The difficulty is memory.usage_in_bytes is the calculation of page cache + tmpfs usage + rss. I had previously seen the memory usage grow over time (by checking the Memory: field in `systemctl status crio-$ctr_id.scope`), but the rss seems to be fairly stable, and clean up after the execs run.

The point of this test is to see what component of usage_in_bytes is growing over time, which will help us figure out if/how we can mitigate it.

[1] https://github.com/haircommander/cri-o/tree/debug-exec-leak

Comment 58 Roman Mohr 2020-08-06 15:50:28 UTC
In addition to providing a real fix, I created https://github.com/kubevirt/kubevirt/pull/3946 for KubeVirt, which will remove the usage of exec probes in the whole project. This should resolve the situation for KubeVirt, but I hope that the root cause can be identified and solved in OpenShift. I decided to go for that since it does not only work around this issue but also helps KubeVirt in general in a few edge cases.

Comment 60 Peter Hunt 2020-08-07 20:50:55 UTC
so Joe and I hopped onto a customer's node and did some playing around.
Specifically we were running a dummy pod that just sleeps, and running a bunch of execs against it, then getting some metrics:
We were using this script to do this:
```
#!/bin/bash
slice=/sys/fs/cgroup/memory$(systemctl status crio-$ctr_id.scope | grep CGroup | awk '{ print $2 }')
echo $slice >&3
attempt=0
sleep 2s
while [ $attempt -le 100 ]; do
	attempt=$((attempt+1))
	crictl exec --sync $ctr_id ls > /dev/null
	echo "rss " $(cat $slice/memory.stat | grep 'total_rss ' | awk '{ print $2/1024/1024 " " MB}')
	echo "usage " $(cat $slice/memory.usage_in_bytes | awk '{ print $1/1024/1024 " " MB}')
done
```
What we found is the rss stayed pretty consistent (around 1MB), and the "usage" increased until it asymptotically approached 36 MB.

Note: AFAIU usage_in_bytes = rss + page cache

it seems the cgroup's page cache grows until 36MB and then the pages are evicted once it gets close to going over.

Why 36 MB?

The kernel has a tuning knob: `/proc/sys/vm/dirty_background_ratio` that's the amount of memory pressure (in precentage of limit) a kernel will keep a dirty page in the cache before evicting.

On the nodes we were looking at it was 10, and (1 - .10) * 40 = 36

In other words, some processes is keeping these dirty pages around until the kernel tries to evict them.

I am not totally sure what, but I'm suspicious of it being runc (or really the go runtime running runc)

Some solutions would be:
- dropping the execs (already proposed)
- increasing the memory amount (so that a spike in the container memory doesn't mean the pod is above memory limit before the kernel can clear the dirty pages)
- tuning the go runtime or runc to relinquish its memory quicker

I will do some investigation on the third option, but I believe the first is a perfectly valid solution (as proposed in https://bugzilla.redhat.com/show_bug.cgi?id=1855067#c54)

(note, I am not a kernel engineer, nor that experience with the minutia of kernel memory usage. Please correct me if I'm wrong and take the above information with a grain of salt)

Comment 63 Dan Kenigsberg 2020-08-11 06:42:28 UTC
> I will do some investigation on the third option, but I believe the first is a perfectly valid solution (as proposed in https://bugzilla.redhat.com/show_bug.cgi?id=1855067#c54)

Thank you Peter. I believe that Roman is working on that, too. But as long as exec probes are supported in k8s/ocp, their causing a (temporary?) leak should probably be solved in OCP.

Comment 64 Roman Mohr 2020-08-11 07:10:00 UTC
(In reply to Dan Kenigsberg from comment #63)
> > I will do some investigation on the third option, but I believe the first is a perfectly valid solution (as proposed in https://bugzilla.redhat.com/show_bug.cgi?id=1855067#c54)
> 
> Thank you Peter. I believe that Roman is working on that, too. But as long
> as exec probes are supported in k8s/ocp, their causing a (temporary?) leak
> should probably be solved in OCP.

Yes I am backporting it for CNV 2.3.1 as we speak. It would be great to get the exec probes fixed independent of that.

Comment 65 Roman Mohr 2020-08-11 07:28:49 UTC
(In reply to Peter Hunt from comment #60)
> so Joe and I hopped onto a customer's node and did some playing around.
> Specifically we were running a dummy pod that just sleeps, and running a
> bunch of execs against it, then getting some metrics:
> We were using this script to do this:
> ```
> #!/bin/bash
> slice=/sys/fs/cgroup/memory$(systemctl status crio-$ctr_id.scope | grep
> CGroup | awk '{ print $2 }')
> echo $slice >&3
> attempt=0
> sleep 2s
> while [ $attempt -le 100 ]; do
> 	attempt=$((attempt+1))
> 	crictl exec --sync $ctr_id ls > /dev/null
> 	echo "rss " $(cat $slice/memory.stat | grep 'total_rss ' | awk '{ print
> $2/1024/1024 " " MB}')
> 	echo "usage " $(cat $slice/memory.usage_in_bytes | awk '{ print
> $1/1024/1024 " " MB}')
> done
> ```
> What we found is the rss stayed pretty consistent (around 1MB), and the
> "usage" increased until it asymptotically approached 36 MB.
> 
> Note: AFAIU usage_in_bytes = rss + page cache
> 
> it seems the cgroup's page cache grows until 36MB and then the pages are
> evicted once it gets close to going over.
> 
> Why 36 MB?
> 
> The kernel has a tuning knob: `/proc/sys/vm/dirty_background_ratio` that's
> the amount of memory pressure (in precentage of limit) a kernel will keep a
> dirty page in the cache before evicting.
> 
> On the nodes we were looking at it was 10, and (1 - .10) * 40 = 36

Would it help if this value is changed to e.g. 8? Would it hurt performance badly?

> 
> In other words, some processes is keeping these dirty pages around until the
> kernel tries to evict them.
> 
> I am not totally sure what, but I'm suspicious of it being runc (or really
> the go runtime running runc)
> 
> Some solutions would be:
> - dropping the execs (already proposed)
> - increasing the memory amount (so that a spike in the container memory
> doesn't mean the pod is above memory limit before the kernel can clear the
> dirty pages)
> - tuning the go runtime or runc to relinquish its memory quicker
> 
> I will do some investigation on the third option, but I believe the first is
> a perfectly valid solution (as proposed in
> https://bugzilla.redhat.com/show_bug.cgi?id=1855067#c54)
> 
> (note, I am not a kernel engineer, nor that experience with the minutia of
> kernel memory usage. Please correct me if I'm wrong and take the above
> information with a grain of salt)

Comment 66 Peter Hunt 2020-08-11 13:39:05 UTC
> Thank you Peter. I believe that Roman is working on that, too. But as long as exec probes are supported in k8s/ocp, their causing a (temporary?) leak should probably be solved in OCP.

I would not really call it a leak. It seems to be acting as expected (process keeps page cache full so it can run faster, kernel evicts if that cache gets too full). The confusing part is why it eventually ends up with an OOM without the kernel first paging out the cache.

> Would it help if this value is changed to e.g. 8? Would it hurt performance badly?

decreasing the number will make the kernel less aggressive during memory pressure. I believe that will make the problem worse. I would recommend *trying* to increase it to see if that mitigates the problem. I don't know if we can tweak it on a finer grain than whole system, so it would impact performance for the whole system, but also would give an indication as to whether the 4MB buffer left (40-36 by default) is not enough for the container workload itself

Unfortunately, I'll be going on PTO this afternoon for the rest of the week. If the issue is getting escalated, further investigation here may need to be delegated to someone else (cc Seth about this)

Comment 111 Andrea Arcangeli 2020-09-10 16:20:29 UTC
Created attachment 1714454 [details]
analysis of vmcore in comment #106

Comment 113 Dan Kenigsberg 2020-09-11 20:08:05 UTC
CNV team already dropped exec probes with bug 1867904.

Andrea, can you summarize you analysis in a formula? Given a system with X CPUs, what is the minimum memory limit that a container withstand for a long period of time?

Peter, at the very least this formula has to be documented; could we even block creation of containers which would ultimately be killed by the system running it?

Comment 117 Dan Kenigsberg 2020-09-12 08:09:16 UTC
I cannot pretend to understand the entire constraints, and neither would our users. I'd like to find a fuzzy statement that is still useful. One option is for OCP to ask each node to report its

$[`tail -n +3 /proc/slabinfo |wc -l` * `nproc` * 32*1024] bytes

and report the maximum of this across all nodes. If a container memory limit is lower this number, its entire memory allocation may eventually be taken by the kernel and it would be oom-killed; so OCP should either block smaller containers or warn about them.

Andrea, would it be reasonably correct to document the situation as:
"""
The kernel scheduler moves containers between CPU threads. Per each CPU thread used, approximately 1/2 MiB may be allocated for kernel slabs. If you do not account to this N * 1/2 MiB overhead, where N is the number of CPU threads on your biggest node, your container may eventually be killed by the system due to Out-Of-Memory error.
"""

Comment 119 Dan Kenigsberg 2020-09-15 04:39:56 UTC
Thanks, Andrea. I've changed the bz summary accordingly.

> I was not shy in suggesting what you want is the behavior of cgroup.memory=nokmem because this behavior too much VM-like and it's simply non suitable for high density, regardless of the actual limit being set low or high, it still costs a lot of RAM that would be saved with cgroup.memory=nokmem.

I suppose that this is a recommendation for RHCOS. Would you express the up/down sides of this kernel command line parameters, so that an educated decision can be made by OCP?

Comment 131 Peter Hunt 2020-10-29 14:03:39 UTC
I have updated the documentation field. I talked with @erich, and we decided it would be best to clone this to RHEL to track the fix in 8.4, as well as move this to a documentation bug to tell users how to mitigate the problem.

Comment 137 Michael Burke 2020-11-11 20:42:06 UTC
Is the issue at the core of this BZ a bug or expected behavior? I haven't read all 100+ comments, but it sounds like there is a RHEL bug?? If so, we sould add the information to the OCP release notes, rather than the proper docs. 
This issue affects OCP 4.4+?

Comment 138 Waiman Long 2020-11-12 15:36:50 UTC
(In reply to Michael Burke from comment #137)
> Is the issue at the core of this BZ a bug or expected behavior? I haven't
> read all 100+ comments, but it sounds like there is a RHEL bug?? If so, we
> sould add the information to the OCP release notes, rather than the proper
> docs. 
> This issue affects OCP 4.4+?

The issue is an expected behavior due to the presence of per-cpu cache in the default SLUB memory allocator to speed up memory allocation request and the use of per memory cgroup kmemcaches for tracking purpose in the upstream kernel up to v5.8. There is a change in behavior in v5.9 kernel to eliminate the use of per memory cgroup kmemcaches that will greatly reduce the additional memory overhead of each additional container. That new feature will be merged into the upcoming RHEL8.4 kernel which will then be picked up by RHCOS and a future release of OCP. There are workarounds, though, that can be used to reduce the memory overhead with the existing kernels at the expense of a little bit of performance.

-Longman

Comment 144 Michael Burke 2020-11-17 17:00:43 UTC
Xiaoli -- Can you please assign this BZ to a QE resource? thank you very much in advance.

Michael

Comment 145 Weinan Liu 2020-12-02 07:36:45 UTC
LGTM
-Weinan Liu


Note You need to log in before you can comment on or make changes to this bug.