Bug 2002873

Summary: OOM kill loop freezing the openshift worker nodes
Product: OpenShift Container Platform Reporter: Mohamed Tleilia <mtleilia>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aivaras.laimikis, aos-bugs, joboyer, openshift-bugs-escalate, sdodson
Version: 4.6.z   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-24 20:29:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Peter Hunt 2021-09-14 15:00:42 UTC
tl;dr: I don't think 12 MB is enough memory for how openshift is currently working:

we intentionally charge the pod cgroup for the infrastructure that's used to create the pod. it's already hard estimating how much the system slice should have, and (at least by default) workloads should be charged for their own overhead. Because of this, we keep conmon in the pod cgroup. I believe 12 MB is not enough for said cgroup.

I have reproduced the issue and found that runc needs about 16 MB to start a container (per process, more on that in a moment). Eventually, at steady state, that memory is not needed anymore, but for container startup it is. When CRI-O starts a container, it starts a conmon process (which uses about 1.5 MB), which then starts a runc process. That runc process sets up some stuff for the container, and then forks to create another runc process. That other runc process will eventually reexec and become the container process. However, before it does so, it needs *another* 16 MB. There are three cases dependent on the amount of memory a pod is given:
<16 MB: the pod doesn't start and also the kernel acts funky. that's because neither conmon, nor runc process 1 are OOM killable (it would be a lot of overhead to track and cleanup spurious kills on these two processes). When a cgroup runs out of memory but the kernel has nothing to kill the kernel gets confused and doesn't  behave that well.
16MB-35MB: The pod doesn't start, but the kernel does have something to OOM kill, as the runc init process (process 2) has the OOM score of the container. This doesn't upset the kernel, but still doesn't allow the pod to start
>35MB: container can start :) 

Yes, moving conmon from the pod slice does alleviate this issue, but it's hard to estimate the toll doing so would take on the system slice. My recommendation is allocate the correct amount of resources for pods to be able to start