Bug 1958108

Summary: KubeMacPool fails to start due to OOM likely caused by a high number of Pods running in the cluster
Product: Container Native Virtualization (CNV) Reporter: Petr Horáček <phoracek>
Component: NetworkingAssignee: Ram Lavi <ralavi>
Status: CLOSED ERRATA QA Contact: Ofir Nash <onash>
Severity: high Docs Contact:
Priority: high    
Version: 2.5.5CC: alitke, cnv-qe-bugs, dvossel, fsilva, hhaberma, kshukla, maugarci, mtessun, myakove, nashok, onash, ralavi, vhernand
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cluster-network-addons-operator-container-v4.8.0-19 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1958816 1958817 (view as bug list) Environment:
Last Closed: 2021-07-27 14:32:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1958816, 1958817    

Description Petr Horáček 2021-05-07 08:21:06 UTC
Description of problem:
When KubeMacPool boots, it attempts to reconcile all already allocated MAC addresses in the cluster. On a big cluster, this can lead into OOM.

This issue was originally raised on https://bugzilla.redhat.com/show_bug.cgi?id=1851829#c6. Find more info and captured artifacts there.


Version-Release number of selected component (if applicable):
CNV 2.5.5


How reproducible:
Always on customer's environment, so far we failed to reproduce it locally.


Steps to Reproduce:
1. Have a cluster with thousands of Pods
2. ... the step above alone is not enough as we were not able to reproduce it locally
3. Install OpenShift Virtualization
Actual results:
The KubeMacPool pod gets killed by kubelet due to OOM. This can be observed through `oc describe pod ...`.


Expected results:
KubeMacPool must not fail due to high number of pods. OpenShift Virtualization should be successfully installed and start running.


Additional info:
When KubeMacPool pod's memory limit is removed (or raised), this issue does no occur.

Comment 4 David Vossel 2021-05-10 18:28:22 UTC
> When KubeMacPool pod's memory limit is removed (or raised), this issue does no occur.

It's important that we remove (and not further introduce) memory limits on our control plane components. Let's only use memory requests.

Comment 5 Ram Lavi 2021-05-11 16:34:27 UTC
Hi David, 
I understand your concern, but I think the solution should be both removing the limit and paginating the pod requests, to keep things working smoothly. 
+I will also run some memory investigation on Kubemacpool, to see if we have more issues such as this.

Comment 18 Ofir Nash 2021-06-27 09:10:39 UTC
Verified on version: cluster-network-addons-operator version is: v4.8.0-23

Scenario Checked:

1. Created 1000 basic VMs (https://github.com/kubevirt/kubevirt/blob/master/examples/vm-cirros.yaml).
2. Checked KubeMacPool pods are still running and didn't crash/ get killed after some time.

(Attached script used to create the VM's).

Comment 28 errata-xmlrpc 2021-07-27 14:32:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920