Description of problem: When KubeMacPool boots, it attempts to reconcile all already allocated MAC addresses in the cluster. On a big cluster, this can lead into OOM. This issue was originally raised on https://bugzilla.redhat.com/show_bug.cgi?id=1851829#c6. Find more info and captured artifacts there. Version-Release number of selected component (if applicable): CNV 2.5.5 How reproducible: Always on customer's environment, so far we failed to reproduce it locally. Steps to Reproduce: 1. Have a cluster with thousands of Pods 2. ... the step above alone is not enough as we were not able to reproduce it locally 3. Install OpenShift Virtualization Actual results: The KubeMacPool pod gets killed by kubelet due to OOM. This can be observed through `oc describe pod ...`. Expected results: KubeMacPool must not fail due to high number of pods. OpenShift Virtualization should be successfully installed and start running. Additional info: When KubeMacPool pod's memory limit is removed (or raised), this issue does no occur.
> When KubeMacPool pod's memory limit is removed (or raised), this issue does no occur. It's important that we remove (and not further introduce) memory limits on our control plane components. Let's only use memory requests.
Hi David, I understand your concern, but I think the solution should be both removing the limit and paginating the pod requests, to keep things working smoothly. +I will also run some memory investigation on Kubemacpool, to see if we have more issues such as this.
Verified on version: cluster-network-addons-operator version is: v4.8.0-23 Scenario Checked: 1. Created 1000 basic VMs (https://github.com/kubevirt/kubevirt/blob/master/examples/vm-cirros.yaml). 2. Checked KubeMacPool pods are still running and didn't crash/ get killed after some time. (Attached script used to create the VM's).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2920