On OpenShift 4.4.5 with openshift-cnv 2.3.0 , the `volumecontainerbootdisk` container of a `virt-launcher-` pod is killed by the out-of-memory killer. The node has sufficient allocatable memory; it appears that the `volumecontainerbootdisk` container is exceeding its memory limit of 40 MB. Our understanding is that openshift-cnv is responsible for setting this memory limit. $ oc -n MY-NAMESPACE get pod virt-launcher-FOO-BAR -o json | jq '.status.containerStatuses[] | select(.name=="volumecontainerbootdisk") | .state' { "terminated": { "containerID": "cri-o://ed5bb05fbc30992202322cab11ee936974d81b680388f6e5db16e5b944c60bad", "exitCode": 137, "finishedAt": "2020-07-06T16:23:09Z", "reason": "OOMKilled", "startedAt": "2020-06-29T18:28:50Z" } } $ oc get node/worker-11 -o json | jq -r .status.allocatable.memory 526669684Ki $ oc -n MY-NAMESPACE get pod virt-launcher-FOO-BAR -o json | jq '.spec.containers[] | select(.name=="volumecontainerbootdisk") | .resources' { "limits": { "cpu": "100m", "memory": "40M" }, "requests": { "cpu": "10m", "memory": "1M" } }
confirmation of CNV version: $ oc -n openshift-cnv get csv/kubevirt-hyperconverged-operator.v2.3.0 NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.3.0 Container-native virtualization 2.3.0 Succeeded
The following look like some of the places where a 40 MiB memory limit is hard-coded: https://github.com/kubevirt/kubevirt/blob/release-0.26/pkg/container-disk/container-disk.go#L136 https://github.com/kubevirt/kubevirt/blob/release-0.26/pkg/virt-controller/services/template.go#L943
Audrey, can we add a release not for this issue? ROman, can you please provide the "Doc Text" field value?
Hello! Below is a summary of the current state: (1) All of the openshift-cnv container images in use have a "version" label of "v2.3.0". (2) The affected RHCOS nodes have the following version and kernel: $ cat /etc/redhat-release Red Hat Enterprise Linux CoreOS release 4.4 $ uname -a Linux worker-11 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Wed Feb 26 03:08:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux (3) The memory map of the process inside the `volumecontainerdisk` container matches the expected value for openshift-cnv 2.3.0 -- total size 1116 KiB. This "1116 KiB" size remains constant, even as the memory usage reported by `oc adm top poc XXX --containers` grows. $ oc -n jteagno-bug-1855067 get pod virt-launcher-example-4b57p -o json | jq -r '.spec.nodeName, (.status.containerStatuses[] | select(.name=="volumecontainerdisk").containerID)' worker-XX cri-o://0c481cc95c61179637185c48214a90119845b3c15c0876737c2abad42fa6eee8 [core@worker-XX ~]$ sudo crictl inspect 0c481cc95c61179637185c48214a90119845b3c15c0876737c2abad42fa6eee8 | jq -r .pid 628151 [core@worker-XX ~]$ sudo pmap -x 628151 628151: /usr/bin/container-disk --copy-path /var/run/kubevirt-ephemeral-disks/container-disk-data/cbb4104a-3cf9-4f5a-8a99-db59b1238b32/disk_0 Address Kbytes RSS Dirty Mode Mapping 0000000000400000 792 704 0 r-x-- container-disk 00000000006c6000 24 24 16 rw--- container-disk 00000000006cc000 4 4 4 rw--- [ anon ] 00000000015f8000 140 8 8 rw--- [ anon ] 00007fff669f7000 132 12 12 rw--- [ stack ] 00007fff66aea000 12 0 0 r---- [ anon ] 00007fff66aed000 8 4 0 r-x-- [ anon ] ffffffffff600000 4 0 0 r-x-- [ anon ] ---------------- ------- ------- ------- total kB 1116 756 40 (4) After starting a new VirtualMachineInstance, the memory usage of the virt-launcher pod's `volumecontainerdisk` container grows from 7 MiB to 10 MiB over four minutes (as seen by `oc adm top pod XXX --containers`). ### Metrics become available after one minute, and the `volumecontainerdisk` container consumes 7 MiB $ oc -n jteagno-bug-1855067 get pod NAME READY STATUS RESTARTS AGE virt-launcher-example-4b57p 2/2 Running 0 64s $ oc -n jteagno-bug-1855067 adm top pod --containers POD NAME CPU(cores) MEMORY(bytes) virt-launcher-example-4b57p compute 1m 215Mi virt-launcher-example-4b57p volumecontainerdisk 1m 7Mi ### After 90 seconds, the memory usage is 9 MiB $ oc -n jteagno-bug-1855067 get pod NAME READY STATUS RESTARTS AGE virt-launcher-example-4b57p 2/2 Running 0 96s $ oc -n jteagno-bug-1855067 adm top pod --containers POD NAME CPU(cores) MEMORY(bytes) virt-launcher-example-4b57p compute 2m 212Mi virt-launcher-example-4b57p volumecontainerdisk 2m 9Mi ### After almost 4 minutes, the memory usage is 10 MiB $ oc -n jteagno-bug-1855067 get pod NAME READY STATUS RESTARTS AGE virt-launcher-example-4b57p 2/2 Running 0 3m49s $ oc -n jteagno-bug-1855067 adm top pod --containers POD NAME CPU(cores) MEMORY(bytes) virt-launcher-example-4b57p compute 6m 218Mi virt-launcher-example-4b57p volumecontainerdisk 5m 10Mi
is there a limit on both of the containers of the pod? what's the full pod spec (including the container definitions)
Shown below are (a) the limits and (b) the memory usage from the systemd crio-FOO.scope unit. These were captured on a virt-launcher pod that has not yet been OOMKilled: $ oc -n MY-NAMESPACE get pod virt-launcher-FOO.BAR -o json | jq '.spec.containers[] | select(.name=="volumecontainerbootdisk") | .resources' { "limits": { "cpu": "100m", "memory": "40M" }, "requests": { "cpu": "10m", "memory": "1M" } } $ oc -n MY-NAMESPACE get pod virt-launcher-FOO.BAR -o json | jq '.spec.containers[] | select(.name=="compute") | .resources' { "limits": { "cpu": "3", "devices.kubevirt.io/kvm": "1", "devices.kubevirt.io/tun": "1", "devices.kubevirt.io/vhost-net": "1", "memory": "10413943040" }, "requests": { "cpu": "2", "devices.kubevirt.io/kvm": "1", "devices.kubevirt.io/tun": "1", "devices.kubevirt.io/vhost-net": "1", "memory": "2048M" } } $ oc -n MY-NAMESPACE adm top pod virt-launcher-FOO.BAR --containers POD NAME CPU(cores) MEMORY(bytes) virt-launcher-FOO.BAR volumecontainerbootdisk 7m 27Mi virt-launcher-FOO.BAR compute 31m 1120Mi $ oc -n MY-NAMESPACE get pod virt-launcher-FOO.BAR -o json | jq -r '.spec.nodeName, (.status.containerStatuses[] | select(.name=="volumecontainerbootdisk") | .containerID)' worker-16 cri-o://8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae [core@worker-16 ~]$ sudo systemctl status crio-8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae.scope ● crio-8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae.scope - libcontainer container 8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae Loaded: loaded (/run/systemd/transient/crio-8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae.scope; transient) Transient: yes Active: active (running) since Mon 2020-06-29 18:28:41 UTC; 2 weeks 3 days ago Tasks: 1 (limit: 1024) Memory: 27.8M (limit: 38.1M) CPU: 3h 7min 52.786s CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod141b9f46_0c6c_43d0_96af_12159795acd8.slice/crio-8907616dec77e6e50b571e87dbd6419f3a5ad21fff9ac95e87a65488e94aafae.scope └─119683 /usr/bin/container-disk --copy-path /var/run/kubevirt-ephemeral-disks/container-disk-data/41964a28-fefa-4d0d-9ea3-504ee40c7d1d/disk_0
@jteagno, @fabiand - Requesting doc text for this defect so that I can add it to the 2.4 CNV release notes.
Started a new run on different test pods to further isolate the issue: ``` $ oc get pods NAME READY STATUS RESTARTS AGE testpod-containerdisk-and-exec 1/1 Running 0 10m # containerdisk includeing exec probe like used in cnv testpod-containerdisk-and-exec-no-limit 1/1 Running 0 10m # containerdisk with exec and not limit set testpod-containerdisk-and-exec-with-sidecar 2/2 Running 0 10m # containerdisk with exec and a sidecar without limits testpod-containerdisk-and-exec-with-sidecar-sleep-echo 2/2 Running 0 10m # like before but with sleep and echo commands to rule out the cnv binary testpod-containerdisk-no-exec 1/1 Running 0 10m # just the containerdisk binary without any exec probe ``` It will probably take at least until tomorrow until I can see trends since on my cluster the memory consumption used to grow way slower than on the user cluster.
(In reply to Fabian Deutsch from comment #19) > Audrey, can we add a release not for this issue? > > ROman, can you please provide the "Doc Text" field value? A reasonable doc text would look like this: Due to an unknown reason the memory consumption vor containerDisks may slowly grow until it reaches its limit. This can take a few days or a few weeks until this issue is experienced. Restarting the VM resolves the issue.
Lavanya is working on the release note.
@rmohr, @fabiand - This PR is ready for review: https://github.com/openshift/openshift-docs/pull/23992 Please add your comments/edits.
@rmohr - review input was incorporated. Please review and advise if it is LGTM and can move to peer review stage.
SME review was completed. Moving the PR to peer review stage.
Roman, what was the outcome of your investigation?
@rmohr's review for the PR was completed with LGTM: https://github.com/openshift/openshift-docs/pull/23992 With SME review completed, this PR is now in peer review stage. @fabiand @rmohr - please move the defect to Verified stage. It is currently in ON-QA stage.
Peer review was completed. @credi input was incorporated. PR was completed and merged: https://github.com/openshift/openshift-docs/pull/23992
(In reply to Roman Mohr from comment #29) > Started a new run on different test pods to further isolate the issue: > > ``` > $ oc get pods > NAME READY STATUS > RESTARTS AGE > testpod-containerdisk-and-exec 1/1 Running 0 > 10m # containerdisk includeing exec probe like used in cnv > testpod-containerdisk-and-exec-no-limit 1/1 Running 0 > 10m # containerdisk with exec and not limit set > testpod-containerdisk-and-exec-with-sidecar 2/2 Running 0 > 10m # containerdisk with exec and a sidecar without limits > testpod-containerdisk-and-exec-with-sidecar-sleep-echo 2/2 Running 0 > 10m # like before but with sleep and echo commands to rule out the cnv binary > testpod-containerdisk-no-exec 1/1 Running 0 > 10m # just the containerdisk binary without any exec probe > ``` > > It will probably take at least until tomorrow until I can see trends since > on my cluster the memory consumption used to grow way slower than on the > user cluster. The outcome of this was sadly not unambiguous. Tests are running right now on two nodes again. On both we have this set of pods running, as well as three VMIs with a containerdisk, to see if it makes any difference if they are running as a side-car or as the main container.
One interesting side-note: Israel has an openshift 4.5 cluster running for almost a month now. The VMIs with containerDisk there also don't show any symptoms.
No difference after running the new extensive test suit on two nodes for 24 hours. The values are constantly within the expected range on 4.4.12.
for reference, here are the differences in the machine-os-content between 4.4.5 and 4.4.12 https://releases-rhcos-art.cloud.privileged.psi.redhat.com/diff.html?first_release=44.81.202005180831-0&first_stream=releases%2Frhcos-4.4&second_release=44.81.202007070223-0&second_stream=releases%2Frhcos-4.4 I am currently looking through cri-o changes to see if anything pops out
I was thinking of https://bugzilla.redhat.com/show_bug.cgi?id=1571183 https://bugzilla.redhat.com/show_bug.cgi?id=1779325 and some other openssl + liveness probe problems. Where really the problem was kernel caching of negative dentries. So what are we doing in the liveness probe in question?
And https://bugzilla.redhat.com/show_bug.cgi?id=1699202
https://bugzilla.redhat.com/show_bug.cgi?id=1699202#c75 shows a leak is fixed when files that are put in tmpfs's are removed (otherwise, they contribute to the cgroups memory count). Does the probe possibly create files in a tmpfs that aren't cleaned up after execution? FWIW: I've already checked, and cri-o does not tell conmon to put a file in the exec-dir, so I do not think the same solution for podman will work for cri-o It's also worth noting that the conmon used for exec is not in the container's cgroup. If anything, it's in the pod's, but I believe it's actually underneath cri-o's (cri-o doesn't move it). If we're seeing a leak similar to the one in podman, I believe it would be coming from the probe, or `runc exec`
(In reply to Peter Hunt from comment #53) > https://bugzilla.redhat.com/show_bug.cgi?id=1699202#c75 shows a leak is > fixed when files that are put in tmpfs's are removed (otherwise, they > contribute to the cgroups memory count). Does the probe possibly create > files in a tmpfs that aren't cleaned up after execution? It only checks if a socket exists and exits afterwards: https://github.com/kubevirt/kubevirt/blob/6f3f5b489a72c12bef026d6c2c3c90fcc63edbe8/cmd/container-disk-v2alpha/main.c#L147 > > FWIW: I've already checked, and cri-o does not tell conmon to put a file in > the exec-dir, so I do not think the same solution for podman will work for > cri-o > > It's also worth noting that the conmon used for exec is not in the > container's cgroup. If anything, it's in the pod's, but I believe it's > actually underneath cri-o's (cri-o doesn't move it). If we're seeing a leak > similar to the one in podman, I believe it would be coming from the probe, > or `runc exec`
given that, it does seem quite improbable that it is the container-disk binary... We can try two different scenerios (if given sufficient access to a reproducing node) 1: we can try with crun instead of runc 2: we can try with a cri-o that runs the exec sync entirely internally (rather than exec'ing to conmon)
it is unlikely we will find the culprit this sprint. but the search continues!
I have been working on a script[1] that measures the rss usage in crio when running a lot of execs. The difficulty is memory.usage_in_bytes is the calculation of page cache + tmpfs usage + rss. I had previously seen the memory usage grow over time (by checking the Memory: field in `systemctl status crio-$ctr_id.scope`), but the rss seems to be fairly stable, and clean up after the execs run. The point of this test is to see what component of usage_in_bytes is growing over time, which will help us figure out if/how we can mitigate it. [1] https://github.com/haircommander/cri-o/tree/debug-exec-leak
In addition to providing a real fix, I created https://github.com/kubevirt/kubevirt/pull/3946 for KubeVirt, which will remove the usage of exec probes in the whole project. This should resolve the situation for KubeVirt, but I hope that the root cause can be identified and solved in OpenShift. I decided to go for that since it does not only work around this issue but also helps KubeVirt in general in a few edge cases.
so Joe and I hopped onto a customer's node and did some playing around. Specifically we were running a dummy pod that just sleeps, and running a bunch of execs against it, then getting some metrics: We were using this script to do this: ``` #!/bin/bash slice=/sys/fs/cgroup/memory$(systemctl status crio-$ctr_id.scope | grep CGroup | awk '{ print $2 }') echo $slice >&3 attempt=0 sleep 2s while [ $attempt -le 100 ]; do attempt=$((attempt+1)) crictl exec --sync $ctr_id ls > /dev/null echo "rss " $(cat $slice/memory.stat | grep 'total_rss ' | awk '{ print $2/1024/1024 " " MB}') echo "usage " $(cat $slice/memory.usage_in_bytes | awk '{ print $1/1024/1024 " " MB}') done ``` What we found is the rss stayed pretty consistent (around 1MB), and the "usage" increased until it asymptotically approached 36 MB. Note: AFAIU usage_in_bytes = rss + page cache it seems the cgroup's page cache grows until 36MB and then the pages are evicted once it gets close to going over. Why 36 MB? The kernel has a tuning knob: `/proc/sys/vm/dirty_background_ratio` that's the amount of memory pressure (in precentage of limit) a kernel will keep a dirty page in the cache before evicting. On the nodes we were looking at it was 10, and (1 - .10) * 40 = 36 In other words, some processes is keeping these dirty pages around until the kernel tries to evict them. I am not totally sure what, but I'm suspicious of it being runc (or really the go runtime running runc) Some solutions would be: - dropping the execs (already proposed) - increasing the memory amount (so that a spike in the container memory doesn't mean the pod is above memory limit before the kernel can clear the dirty pages) - tuning the go runtime or runc to relinquish its memory quicker I will do some investigation on the third option, but I believe the first is a perfectly valid solution (as proposed in https://bugzilla.redhat.com/show_bug.cgi?id=1855067#c54) (note, I am not a kernel engineer, nor that experience with the minutia of kernel memory usage. Please correct me if I'm wrong and take the above information with a grain of salt)
> I will do some investigation on the third option, but I believe the first is a perfectly valid solution (as proposed in https://bugzilla.redhat.com/show_bug.cgi?id=1855067#c54) Thank you Peter. I believe that Roman is working on that, too. But as long as exec probes are supported in k8s/ocp, their causing a (temporary?) leak should probably be solved in OCP.
(In reply to Dan Kenigsberg from comment #63) > > I will do some investigation on the third option, but I believe the first is a perfectly valid solution (as proposed in https://bugzilla.redhat.com/show_bug.cgi?id=1855067#c54) > > Thank you Peter. I believe that Roman is working on that, too. But as long > as exec probes are supported in k8s/ocp, their causing a (temporary?) leak > should probably be solved in OCP. Yes I am backporting it for CNV 2.3.1 as we speak. It would be great to get the exec probes fixed independent of that.
(In reply to Peter Hunt from comment #60) > so Joe and I hopped onto a customer's node and did some playing around. > Specifically we were running a dummy pod that just sleeps, and running a > bunch of execs against it, then getting some metrics: > We were using this script to do this: > ``` > #!/bin/bash > slice=/sys/fs/cgroup/memory$(systemctl status crio-$ctr_id.scope | grep > CGroup | awk '{ print $2 }') > echo $slice >&3 > attempt=0 > sleep 2s > while [ $attempt -le 100 ]; do > attempt=$((attempt+1)) > crictl exec --sync $ctr_id ls > /dev/null > echo "rss " $(cat $slice/memory.stat | grep 'total_rss ' | awk '{ print > $2/1024/1024 " " MB}') > echo "usage " $(cat $slice/memory.usage_in_bytes | awk '{ print > $1/1024/1024 " " MB}') > done > ``` > What we found is the rss stayed pretty consistent (around 1MB), and the > "usage" increased until it asymptotically approached 36 MB. > > Note: AFAIU usage_in_bytes = rss + page cache > > it seems the cgroup's page cache grows until 36MB and then the pages are > evicted once it gets close to going over. > > Why 36 MB? > > The kernel has a tuning knob: `/proc/sys/vm/dirty_background_ratio` that's > the amount of memory pressure (in precentage of limit) a kernel will keep a > dirty page in the cache before evicting. > > On the nodes we were looking at it was 10, and (1 - .10) * 40 = 36 Would it help if this value is changed to e.g. 8? Would it hurt performance badly? > > In other words, some processes is keeping these dirty pages around until the > kernel tries to evict them. > > I am not totally sure what, but I'm suspicious of it being runc (or really > the go runtime running runc) > > Some solutions would be: > - dropping the execs (already proposed) > - increasing the memory amount (so that a spike in the container memory > doesn't mean the pod is above memory limit before the kernel can clear the > dirty pages) > - tuning the go runtime or runc to relinquish its memory quicker > > I will do some investigation on the third option, but I believe the first is > a perfectly valid solution (as proposed in > https://bugzilla.redhat.com/show_bug.cgi?id=1855067#c54) > > (note, I am not a kernel engineer, nor that experience with the minutia of > kernel memory usage. Please correct me if I'm wrong and take the above > information with a grain of salt)
> Thank you Peter. I believe that Roman is working on that, too. But as long as exec probes are supported in k8s/ocp, their causing a (temporary?) leak should probably be solved in OCP. I would not really call it a leak. It seems to be acting as expected (process keeps page cache full so it can run faster, kernel evicts if that cache gets too full). The confusing part is why it eventually ends up with an OOM without the kernel first paging out the cache. > Would it help if this value is changed to e.g. 8? Would it hurt performance badly? decreasing the number will make the kernel less aggressive during memory pressure. I believe that will make the problem worse. I would recommend *trying* to increase it to see if that mitigates the problem. I don't know if we can tweak it on a finer grain than whole system, so it would impact performance for the whole system, but also would give an indication as to whether the 4MB buffer left (40-36 by default) is not enough for the container workload itself Unfortunately, I'll be going on PTO this afternoon for the rest of the week. If the issue is getting escalated, further investigation here may need to be delegated to someone else (cc Seth about this)
Created attachment 1714454 [details] analysis of vmcore in comment #106
CNV team already dropped exec probes with bug 1867904. Andrea, can you summarize you analysis in a formula? Given a system with X CPUs, what is the minimum memory limit that a container withstand for a long period of time? Peter, at the very least this formula has to be documented; could we even block creation of containers which would ultimately be killed by the system running it?
I cannot pretend to understand the entire constraints, and neither would our users. I'd like to find a fuzzy statement that is still useful. One option is for OCP to ask each node to report its $[`tail -n +3 /proc/slabinfo |wc -l` * `nproc` * 32*1024] bytes and report the maximum of this across all nodes. If a container memory limit is lower this number, its entire memory allocation may eventually be taken by the kernel and it would be oom-killed; so OCP should either block smaller containers or warn about them. Andrea, would it be reasonably correct to document the situation as: """ The kernel scheduler moves containers between CPU threads. Per each CPU thread used, approximately 1/2 MiB may be allocated for kernel slabs. If you do not account to this N * 1/2 MiB overhead, where N is the number of CPU threads on your biggest node, your container may eventually be killed by the system due to Out-Of-Memory error. """
Thanks, Andrea. I've changed the bz summary accordingly. > I was not shy in suggesting what you want is the behavior of cgroup.memory=nokmem because this behavior too much VM-like and it's simply non suitable for high density, regardless of the actual limit being set low or high, it still costs a lot of RAM that would be saved with cgroup.memory=nokmem. I suppose that this is a recommendation for RHCOS. Would you express the up/down sides of this kernel command line parameters, so that an educated decision can be made by OCP?
I have updated the documentation field. I talked with @erich, and we decided it would be best to clone this to RHEL to track the fix in 8.4, as well as move this to a documentation bug to tell users how to mitigate the problem.
Is the issue at the core of this BZ a bug or expected behavior? I haven't read all 100+ comments, but it sounds like there is a RHEL bug?? If so, we sould add the information to the OCP release notes, rather than the proper docs. This issue affects OCP 4.4+?
(In reply to Michael Burke from comment #137) > Is the issue at the core of this BZ a bug or expected behavior? I haven't > read all 100+ comments, but it sounds like there is a RHEL bug?? If so, we > sould add the information to the OCP release notes, rather than the proper > docs. > This issue affects OCP 4.4+? The issue is an expected behavior due to the presence of per-cpu cache in the default SLUB memory allocator to speed up memory allocation request and the use of per memory cgroup kmemcaches for tracking purpose in the upstream kernel up to v5.8. There is a change in behavior in v5.9 kernel to eliminate the use of per memory cgroup kmemcaches that will greatly reduce the additional memory overhead of each additional container. That new feature will be merged into the upcoming RHEL8.4 kernel which will then be picked up by RHCOS and a future release of OCP. There are workarounds, though, that can be used to reduce the memory overhead with the existing kernels at the expense of a little bit of performance. -Longman
Xiaoli -- Can you please assign this BZ to a QE resource? thank you very much in advance. Michael
LGTM -Weinan Liu
Changes are live: https://docs.openshift.com/container-platform/4.6/nodes/containers/nodes-containers-using.html