Description of problem: On some systems we see that the containerDisk container gets OOMKilled quite regularly. The reason seems to be connected to dynamic memory management of golang and possibly things like the used kernel version. The golang memory spikes are harmless in general and not big, but they are big enough to sometimes hit the memory limit of 40 M on the containerDisk container. We had once a case 6 months ago on Azure on kubevirt 0.20, where people reported that issue. There it was solved by bumping the limit to 40MB. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: VMs get on some nodes frequently restarted. Expected results: VMs should not be restarted. Additional info: It is possible to work around that by using a DataVolume and specify there the containerDisk as the import source. The possible disadvantages are more storage and bandwidth consumption on the distributed storage for ephemeral data which the VMs normally don't have to keep after restarts. The storage usage can be reduced by putting the resulting PVC as an ephemeral volume on the VM, so that one PVC can be used for multiple VMs, but it still puts more pressure on the distributed storage than the containerDisk. We are proposing https://github.com/kubevirt/kubevirt/pull/2844 to fix this in kubevirt. We basically rewrite the containerDisk binary in C to have more guarantees regarding to the memory consumption. We could also increase the memory limit, but this would have significant impact on the ram usage.
As this is becoming important, I raised the Customer Escalation Flag.
Tested this, monitored for almost 72hrs and seen no memory increase with containerDisk container. [root@cnvqe-01 ~]# oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master-0.testing.redhat.com Ready master 8d v1.17.1 10.46.8.33 <none> Red Hat Enterprise Linux CoreOS 44.81.202003230949-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8 master-1.testing.redhat.com Ready master 8d v1.17.1 10.46.8.36 <none> Red Hat Enterprise Linux CoreOS 44.81.202003230949-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8 master-2.testing.redhat.com Ready master 8d v1.17.1 10.46.8.37 <none> Red Hat Enterprise Linux CoreOS 44.81.202003230949-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8 worker-0.testing.redhat.com Ready worker 8d v1.17.1 10.46.8.38 <none> Red Hat Enterprise Linux CoreOS 44.81.202003230949-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8 worker-1.testing.redhat.com Ready worker 8d v1.17.1 10.46.8.39 <none> Red Hat Enterprise Linux CoreOS 44.81.202003230949-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8 [root@cnvqe-01 ~]# kubectl top pod --containers --namespace=default POD NAME CPU(cores) MEMORY(bytes) virt-launcher-rhel78-hpp-vm-zh2c9 compute 20m 863Mi virt-launcher-rhel78-vm-rb9r8 compute 21m 884Mi virt-launcher-vm-fedora-rmm4k compute 23m 1107Mi virt-launcher-vm-fedora-rmm4k volumecontainerdisk 7m 22Mi virt-launcher-vm-fedora1-8mxz6 volumecontainerdisk 7m 22Mi virt-launcher-vm-fedora1-8mxz6 compute 22m 1106Mi [root@cnvqe-01 ~]# oc get vmi NAME AGE PHASE IP NODENAME rhel78-hpp-vm 5d Running 10.0.2.2/24 worker-0.testing.redhat.com rhel78-vm 5d Running 10.0.2.2/24 worker-0.testing.redhat.com vm-fedora 5d18h Running 10.128.2.102/23 worker-1.testing.redhat.com vm-fedora1 5d18h Running 10.128.2.101/23 worker-1.testing.redhat.com [root@cnvqe-01 ~]# oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-rhel78-hpp-vm-zh2c9 1/1 Running 0 5d virt-launcher-rhel78-vm-rb9r8 1/1 Running 0 5d virt-launcher-vm-fedora-rmm4k 2/2 Running 0 5d18h virt-launcher-vm-fedora1-8mxz6 2/2 Running 0 5d18h Will be moving this to VERIFIED state now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:2011
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days