Bug 1478605
| Summary: | RFE: Kill and enforce backoff for pods that contain processes which cause high OOM event rates | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ryan Howe <rhowe> |
| Component: | RFE | Assignee: | Eric Paris <eparis> |
| Status: | CLOSED NOTABUG | QA Contact: | Xiaoli Tian <xtian> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.5.0 | CC: | aos-bugs, ekin.meroglu, jokerman, mmccomas, rbost, sjenning, tkatarki |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-06-10 19:32:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ryan Howe
2017-08-05 01:02:19 UTC
This is a limitation of how the oom-killer works, with no knowledge of what a container is, and how much visibility docker has into the processes running in a container. If the docker cmd/entrypoint (pid 1) for the container is killed by OOM, then docker is aware, as it is the parent process, and will report that the container was OOM killed. That propagates up to OpenShift and pod restart backoff can occur, limiting the number of times the pod with problematic resource requests/limits can try to start. However, if the killed process is not pid 1 inside the container (a child of pid 1), docker does not know. pid 1 in the container can try to refork the child as fast as it wants and there will be a ton of OOM killing activity. OpenShift uses oom_score_adj to influence killing between QoS levels, but oom_score_adj is inherited by child process, so it has no effect when comparing parents and children. I can only think of two ways this could be handled at the moment: 1: docker/OpenShift sets oom_kill_disable true. This causes programs that would normally get OOM killed to hang until the memory pressure is resolved at the cgroup level. It also causes under_oom to become true. docker could watch for under_oom=true and kill the entire container in that case. 2: cadvisor in OpenShift returns OOM event from watching the kernel log and we magically work back from the pid of the child in the container to the pod that child was in and do backoff. Both of these would be features though, not bug fixes. They would be invasive. The correct action is to set pod resource requests/limits such that OOM killing does not occur. Ratelimiting/detection of child kills/respawns by pid 1 inside the container are not something OpenShift has visibility into. I agree this should be a feature request (RFE) where we would want to see some ratelimit in the amount of times oom-killer is invoked on a single container. Where when the limit is hit the pod/container is killed. As cluster admin I want be able to set default limits. With these limits set I do not want to worry about a users app being deployed to a node and invoking oom-killer many times, causing the node to possibly become unstable. Converting this to an RFE This kind of use case, that is, a parent process causing many 100s of child process being spawned is not a typical use case for containers and Kubernetes. There are safe-guards built in such as Pod Pid Limiting (https://kubernetes.io/blog/2019/04/15/process-id-limiting-for-stability-improvements-in-kubernetes-1.14/) that prevents malicious intent. Having said that, there is no way to support this kind of use. In those cases, there are other ways possible. And future ones that are being worked in community (such as co scheduling of gang scheduling) etc. Closing RFE as nothing can be done. |