Description of problem: Related to https://bugzilla.redhat.com/show_bug.cgi?id=1457462 but when a memory cgroup is placed on a pod, and that pod spawns many child processes and invokes oom-killer, we can see oom-killer being call 100-1k times a day. Version-Release number of selected component (if applicable): OCP 3.5 Docker 1.12 RHEL 7 How reproducible: Have not created one yet Steps to Reproduce: 1. Create a container image that when started will spawn many child processes 2. Create a pod and give it just enough memory limits to let the parent start and spawn X amount of child processes 3. Deploy to node. 4. oom-killer invoked where it kills child process over parent due to oom_score Actual results: 100s - 1000s of times oom-killer is invoked. kernel: Memory cgroup out of memory: Kill process 107560 (gunicorn) score 1041 or sacrifice child kernel: Killed process 107560 (gunicorn) total-vm:172528kB, anon-rss:24272kB, file-rss:0kB, shmem-rss:0kB $ grep "Jul 30.*invoked oom-killer" journalctl_--no-pager_--boot | wc -l 10163 Expected results: OpenShift see oom-killer is being invoked to many times and for it to kill the pod or restart the pod.
This is a limitation of how the oom-killer works, with no knowledge of what a container is, and how much visibility docker has into the processes running in a container. If the docker cmd/entrypoint (pid 1) for the container is killed by OOM, then docker is aware, as it is the parent process, and will report that the container was OOM killed. That propagates up to OpenShift and pod restart backoff can occur, limiting the number of times the pod with problematic resource requests/limits can try to start. However, if the killed process is not pid 1 inside the container (a child of pid 1), docker does not know. pid 1 in the container can try to refork the child as fast as it wants and there will be a ton of OOM killing activity. OpenShift uses oom_score_adj to influence killing between QoS levels, but oom_score_adj is inherited by child process, so it has no effect when comparing parents and children. I can only think of two ways this could be handled at the moment: 1: docker/OpenShift sets oom_kill_disable true. This causes programs that would normally get OOM killed to hang until the memory pressure is resolved at the cgroup level. It also causes under_oom to become true. docker could watch for under_oom=true and kill the entire container in that case. 2: cadvisor in OpenShift returns OOM event from watching the kernel log and we magically work back from the pid of the child in the container to the pod that child was in and do backoff. Both of these would be features though, not bug fixes. They would be invasive. The correct action is to set pod resource requests/limits such that OOM killing does not occur. Ratelimiting/detection of child kills/respawns by pid 1 inside the container are not something OpenShift has visibility into.
I agree this should be a feature request (RFE) where we would want to see some ratelimit in the amount of times oom-killer is invoked on a single container. Where when the limit is hit the pod/container is killed. As cluster admin I want be able to set default limits. With these limits set I do not want to worry about a users app being deployed to a node and invoking oom-killer many times, causing the node to possibly become unstable.
Converting this to an RFE
This kind of use case, that is, a parent process causing many 100s of child process being spawned is not a typical use case for containers and Kubernetes. There are safe-guards built in such as Pod Pid Limiting (https://kubernetes.io/blog/2019/04/15/process-id-limiting-for-stability-improvements-in-kubernetes-1.14/) that prevents malicious intent. Having said that, there is no way to support this kind of use. In those cases, there are other ways possible. And future ones that are being worked in community (such as co scheduling of gang scheduling) etc. Closing RFE as nothing can be done.