Bug 1817057

Summary: ContainerDisk is sometimes OOMKilled on some systems
Product: Container Native Virtualization (CNV) Reporter: Roman Mohr <rmohr>
Component: VirtualizationAssignee: Petr Kotas <pkotas>
Status: CLOSED ERRATA QA Contact: Kedar Bidarkar <kbidarka>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 2.2.0CC: augol, cnv-qe-bugs, danken, dsafford, ibezukh, ipinto, kbidarka, ncredi, pelauter, pkotas, sgott
Target Milestone: ---Flags: ncredi: needinfo+
Target Release: 2.3.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: hco-bundle-registry-container-v2.3.0-70 virt-operator-container-v2.3.0-36 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-04 19:10:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Roman Mohr 2020-03-25 13:44:17 UTC
Description of problem:

On some systems we see that the containerDisk container gets OOMKilled quite regularly. The reason seems to be connected to dynamic memory management of golang and possibly things like the used kernel version. The golang memory spikes are harmless in general and not big, but they are big enough to sometimes hit the memory limit of 40 M on the containerDisk container.

We had once a case 6 months ago on Azure on kubevirt 0.20, where people reported that issue. There it was solved by bumping the limit to 40MB.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

VMs get on some nodes frequently restarted.

Expected results:

VMs should not be restarted.

Additional info:

It is possible to work around that by using a DataVolume and specify there the containerDisk as the import source. The possible disadvantages are more storage and bandwidth consumption on the distributed storage for ephemeral data which the VMs normally don't have to keep after restarts. The storage usage can be reduced by putting the resulting PVC as an ephemeral volume on the VM, so that one PVC can be used for multiple VMs, but it still puts more pressure on the distributed storage than the containerDisk.

We are proposing https://github.com/kubevirt/kubevirt/pull/2844 to fix this in kubevirt. We basically rewrite the containerDisk binary in C to have more guarantees regarding to the memory consumption. We could also increase the memory limit, but this would have significant impact on the ram usage.

Comment 6 Dana Safford 2020-04-02 18:57:16 UTC
As this is becoming important, I raised the Customer Escalation Flag.

Comment 7 Kedar Bidarkar 2020-04-06 14:46:24 UTC

Tested this, monitored for almost 72hrs and seen no memory increase with containerDisk container.

[root@cnvqe-01 ~]# oc get nodes -o wide 
NAME                                      STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
master-0.testing.redhat.com   Ready    master   8d    v1.17.1    <none>        Red Hat Enterprise Linux CoreOS 44.81.202003230949-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
master-1.testing.redhat.com   Ready    master   8d    v1.17.1    <none>        Red Hat Enterprise Linux CoreOS 44.81.202003230949-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
master-2.testing.redhat.com   Ready    master   8d    v1.17.1    <none>        Red Hat Enterprise Linux CoreOS 44.81.202003230949-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
worker-0.testing.redhat.com   Ready    worker   8d    v1.17.1    <none>        Red Hat Enterprise Linux CoreOS 44.81.202003230949-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
worker-1.testing.redhat.com   Ready    worker   8d    v1.17.1    <none>        Red Hat Enterprise Linux CoreOS 44.81.202003230949-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
[root@cnvqe-01 ~]# kubectl top pod --containers --namespace=default
POD                                 NAME                  CPU(cores)   MEMORY(bytes)   
virt-launcher-rhel78-hpp-vm-zh2c9   compute               20m          863Mi           
virt-launcher-rhel78-vm-rb9r8       compute               21m          884Mi           
virt-launcher-vm-fedora-rmm4k       compute               23m          1107Mi          
virt-launcher-vm-fedora-rmm4k       volumecontainerdisk   7m           22Mi            
virt-launcher-vm-fedora1-8mxz6      volumecontainerdisk   7m           22Mi            
virt-launcher-vm-fedora1-8mxz6      compute               22m          1106Mi          
[root@cnvqe-01 ~]# oc get vmi
NAME            AGE     PHASE     IP                NODENAME
rhel78-hpp-vm   5d      Running       worker-0.testing.redhat.com
rhel78-vm       5d      Running       worker-0.testing.redhat.com
vm-fedora       5d18h   Running   worker-1.testing.redhat.com
vm-fedora1      5d18h   Running   worker-1.testing.redhat.com
[root@cnvqe-01 ~]# oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
virt-launcher-rhel78-hpp-vm-zh2c9   1/1     Running   0          5d
virt-launcher-rhel78-vm-rb9r8       1/1     Running   0          5d
virt-launcher-vm-fedora-rmm4k       2/2     Running   0          5d18h
virt-launcher-vm-fedora1-8mxz6      2/2     Running   0          5d18h

Will be moving this to VERIFIED state now.

Comment 10 errata-xmlrpc 2020-05-04 19:10:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Comment 14 Red Hat Bugzilla 2024-01-06 04:28:41 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days