Bug 1958108 - KubeMacPool fails to start due to OOM likely caused by a high number of Pods running in the cluster
Summary: KubeMacPool fails to start due to OOM likely caused by a high number of Pods ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 2.5.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Ram Lavi
QA Contact: Ofir Nash
URL:
Whiteboard:
Depends On:
Blocks: 1958816 1958817
TreeView+ depends on / blocked
 
Reported: 2021-05-07 08:21 UTC by Petr Horáček
Modified: 2024-10-01 18:08 UTC (History)
13 users (show)

Fixed In Version: cluster-network-addons-operator-container-v4.8.0-19
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1958816 1958817 (view as bug list)
Environment:
Last Closed: 2021-07-27 14:32:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github k8snetworkplumbingwg kubemacpool pull 296 0 None closed initPodMap, Limit podList request to avoid memory max limit 2021-05-13 18:27:26 UTC
Github k8snetworkplumbingwg kubemacpool pull 297 0 None closed [release-v0.21] initPodMap, Limit podList request to avoid memory max limit 2021-05-13 18:28:11 UTC
Red Hat Issue Tracker CNV-11903 0 None None None 2022-12-14 02:32:44 UTC
Red Hat Product Errata RHSA-2021:2920 0 None None None 2021-07-27 14:33:20 UTC

Description Petr Horáček 2021-05-07 08:21:06 UTC
Description of problem:
When KubeMacPool boots, it attempts to reconcile all already allocated MAC addresses in the cluster. On a big cluster, this can lead into OOM.

This issue was originally raised on https://bugzilla.redhat.com/show_bug.cgi?id=1851829#c6. Find more info and captured artifacts there.


Version-Release number of selected component (if applicable):
CNV 2.5.5


How reproducible:
Always on customer's environment, so far we failed to reproduce it locally.


Steps to Reproduce:
1. Have a cluster with thousands of Pods
2. ... the step above alone is not enough as we were not able to reproduce it locally
3. Install OpenShift Virtualization
Actual results:
The KubeMacPool pod gets killed by kubelet due to OOM. This can be observed through `oc describe pod ...`.


Expected results:
KubeMacPool must not fail due to high number of pods. OpenShift Virtualization should be successfully installed and start running.


Additional info:
When KubeMacPool pod's memory limit is removed (or raised), this issue does no occur.

Comment 4 David Vossel 2021-05-10 18:28:22 UTC
> When KubeMacPool pod's memory limit is removed (or raised), this issue does no occur.

It's important that we remove (and not further introduce) memory limits on our control plane components. Let's only use memory requests.

Comment 5 Ram Lavi 2021-05-11 16:34:27 UTC
Hi David, 
I understand your concern, but I think the solution should be both removing the limit and paginating the pod requests, to keep things working smoothly. 
+I will also run some memory investigation on Kubemacpool, to see if we have more issues such as this.

Comment 18 Ofir Nash 2021-06-27 09:10:39 UTC
Verified on version: cluster-network-addons-operator version is: v4.8.0-23

Scenario Checked:

1. Created 1000 basic VMs (https://github.com/kubevirt/kubevirt/blob/master/examples/vm-cirros.yaml).
2. Checked KubeMacPool pods are still running and didn't crash/ get killed after some time.

(Attached script used to create the VM's).

Comment 28 errata-xmlrpc 2021-07-27 14:32:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920


Note You need to log in before you can comment on or make changes to this bug.