Bug 1802687

Summary: A pod that gradually leaks memory causes node to become unreachable for 10 minutes
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED ERRATA QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: aos-bugs, cfillekes, fdeutsch, hannsj_uhl, Holger.Wolf, jokerman, jshepherd, lxia, rphillips, schoudha, surbania, tdale, vlaad, wabouham, walters, wking, wsun, zyu
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The kubepods.slice memory cgroup was not being set correctly, and was being set to max memory on the node. Consequence: This would cause the Kubelet to not reserve memory and CPU resources for system components (including kubelet and crio) causing kernel pauses and other OOM conditions in the cloud. Fix: The kubepods.slice memory limit is now set correctly. Result: Pods should be evicted when using [max-memory] - [system reservation].
Story Points: ---
Clone Of: 1800319
: 1825989 (view as bug list) Environment:
Last Closed: 2020-05-21 18:09:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1800319, 1806786, 1808429, 1810136    
Bug Blocks: 1765215, 1766237, 1801826, 1801829, 1808444, 1814187, 1814804    

Description Clayton Coleman 2020-02-13 17:35:07 UTC
Exhausting memory on the node should not cause a permanent failure.  The previous bug was fixed by increasing reservations - we must still understand the root failure and add e2e tests that prevent it.  High severity because customer environments will easily exceed 1Gi reservations.

This is a 4.4 GA blocker even with the higher reservation (which more realistically describes what we actually use on the system)

+++ This bug was initially created as a clone of Bug #1800319 +++

Creating a memory hogger pod (that should be evicted / OOM killed) instead of being safely handled by the node causes the node to become unreachable for >10m.  On the node, the kubelet appears to be running but can't heartbeat the apiserver.  Also, the node appears to think that the apiserver deleted all the pods (DELETE("api") in logs) which is not correct - no pods except the oomkilled one should be evicted / deleted.

Recreate

1. Create the attached kill-node.yaml on the cluster (oc create -f kill-node.yaml)
2. Wait 2-3 minutes while memory fills up on the worker

Expected:

1. memory-hog pod is oomkilled and/or evicted (either would be acceptable)
2. the node remains ready

Actual:

1. Node is tainted as unreachable, heartbeats stop, and it takes >10m for it to recover
2. After recovery, events are delivered

As part of fixing this, we need to add an e2e tests to the origin disruptive suite that triggers this (and add eviction tests, because this doesn't seem to evict anything).

--- Additional comment from Clayton Coleman on 2020-02-06 16:14:33 EST ---

Once this is fixed we need to test against 4.3 and 4.2 and backport if it happens - this can DoS a node.

Comment 1 Colin Walters 2020-02-14 13:34:32 UTC
Today we don't take control over the OOM handling.  To the best of my knowledge, if one has pods configured without hard limits (common) then what's going to happen is the default OOM killer will be invoked and it can kill any process.

For most of our core processes, systemd will restart them if they're killed, but we don't regularly test that.  Adding reservations makes it less likely we'll overcommit in a situation with hard limits.

The recent trend has been userspace policy driven OOM handling, e.g.
https://source.android.com/devices/tech/perf/lmkd
and most recently for us:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom

That one's about swap but it's certainly possible to have issues even without swap.

https://github.com/facebookincubator/oomd

is also relevant.

This all said - let's get a bit more data here about what's happening; in particular which process is being killed.

Comment 6 Ryan Phillips 2020-03-06 18:42:39 UTC
*** Bug 1809606 has been marked as a duplicate of this bug. ***

Comment 8 Ryan Phillips 2020-03-17 13:41:43 UTC
*** Bug 1814187 has been marked as a duplicate of this bug. ***

Comment 9 Ryan Phillips 2020-03-17 13:41:46 UTC
*** Bug 1811924 has been marked as a duplicate of this bug. ***