Bug 1478605 - RFE: Kill and enforce backoff for pods that contain processes which cause high OOM event rates
RFE: Kill and enforce backoff for pods that contain processes which cause hig...
Status: NEW
Product: OpenShift Container Platform
Classification: Red Hat
Component: RFE (Show other bugs)
Unspecified Unspecified
unspecified Severity medium
: ---
: ---
Assigned To: Eric Paris
Xiaoli Tian
Depends On:
  Show dependency treegraph
Reported: 2017-08-04 21:02 EDT by Ryan Howe
Modified: 2018-01-25 16:41 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Ryan Howe 2017-08-04 21:02:19 EDT
Description of problem:
  Related to   https://bugzilla.redhat.com/show_bug.cgi?id=1457462 but when a memory cgroup is placed on a pod, and that pod spawns many child processes and invokes oom-killer, we can see oom-killer being call 100-1k times a day. 

Version-Release number of selected component (if applicable):
OCP 3.5
Docker 1.12

How reproducible:
Have not created one yet

Steps to Reproduce:
1. Create a container image that when started will spawn many child processes
2. Create a pod and give it just enough memory limits to let the parent start and spawn X amount of child processes 
3. Deploy to node. 
4. oom-killer invoked where it kills child process over parent due to oom_score

Actual results:
100s - 1000s of times oom-killer is invoked.

kernel: Memory cgroup out of memory: Kill process 107560 (gunicorn) score 1041 or sacrifice child
kernel: Killed process 107560 (gunicorn) total-vm:172528kB, anon-rss:24272kB, file-rss:0kB, shmem-rss:0kB

$ grep "Jul 30.*invoked oom-killer" journalctl_--no-pager_--boot | wc -l

Expected results:
OpenShift see oom-killer is being invoked to many times and for it to kill the pod or restart the pod.
Comment 1 Seth Jennings 2017-08-08 15:41:26 EDT
This is a limitation of how the oom-killer works, with no knowledge of what a container is, and how much visibility docker has into the processes running in a container.

If the docker cmd/entrypoint (pid 1) for the container is killed by OOM, then docker is aware, as it is the parent process, and will report that the container was OOM killed. That propagates up to OpenShift and pod restart backoff can occur, limiting the number of times the pod with problematic resource requests/limits can try to start.

However, if the killed process is not pid 1 inside the container (a child of pid 1), docker does not know.  pid 1 in the container can try to refork the child as fast as it wants and there will be a ton of OOM killing activity.

OpenShift uses oom_score_adj to influence killing between QoS levels, but oom_score_adj is inherited by child process, so it has no effect when comparing parents and children.

I can only think of two ways this could be handled at the moment:


docker/OpenShift sets oom_kill_disable true.  This causes programs that would normally get OOM killed to hang until the memory pressure is resolved at the cgroup level.  It also causes under_oom to become true.  docker could watch for under_oom=true and kill the entire container in that case.


cadvisor in OpenShift returns OOM event from watching the kernel log and we magically work back from the pid of the child in the container to the pod that child was in and do backoff.

Both of these would be features though, not bug fixes.  They would be invasive.

The correct action is to set pod resource requests/limits such that OOM killing does not occur.  Ratelimiting/detection of child kills/respawns by pid 1 inside the container are not something OpenShift has visibility into.
Comment 2 Ryan Howe 2017-08-08 16:01:09 EDT
I agree this should be a feature request (RFE) where we would want to see some ratelimit in the amount of times oom-killer is invoked on a single container. Where when the limit is hit the pod/container is killed. 

As cluster admin I want be able to set default limits. With these limits set I do not want to worry about a users app being deployed to a node and invoking oom-killer many times, causing the node to possibly become unstable.
Comment 3 Seth Jennings 2017-08-08 16:13:23 EDT
Converting this to an RFE

Note You need to log in before you can comment on or make changes to this bug.