Exhausting memory on the node should not cause a permanent failure. The previous bug was fixed by increasing reservations - we must still understand the root failure and add e2e tests that prevent it. High severity because customer environments will easily exceed 1Gi reservations. This is a 4.4 GA blocker even with the higher reservation (which more realistically describes what we actually use on the system) +++ This bug was initially created as a clone of Bug #1800319 +++ Creating a memory hogger pod (that should be evicted / OOM killed) instead of being safely handled by the node causes the node to become unreachable for >10m. On the node, the kubelet appears to be running but can't heartbeat the apiserver. Also, the node appears to think that the apiserver deleted all the pods (DELETE("api") in logs) which is not correct - no pods except the oomkilled one should be evicted / deleted. Recreate 1. Create the attached kill-node.yaml on the cluster (oc create -f kill-node.yaml) 2. Wait 2-3 minutes while memory fills up on the worker Expected: 1. memory-hog pod is oomkilled and/or evicted (either would be acceptable) 2. the node remains ready Actual: 1. Node is tainted as unreachable, heartbeats stop, and it takes >10m for it to recover 2. After recovery, events are delivered As part of fixing this, we need to add an e2e tests to the origin disruptive suite that triggers this (and add eviction tests, because this doesn't seem to evict anything). --- Additional comment from Clayton Coleman on 2020-02-06 16:14:33 EST --- Once this is fixed we need to test against 4.3 and 4.2 and backport if it happens - this can DoS a node.
Today we don't take control over the OOM handling. To the best of my knowledge, if one has pods configured without hard limits (common) then what's going to happen is the default OOM killer will be invoked and it can kill any process. For most of our core processes, systemd will restart them if they're killed, but we don't regularly test that. Adding reservations makes it less likely we'll overcommit in a situation with hard limits. The recent trend has been userspace policy driven OOM handling, e.g. https://source.android.com/devices/tech/perf/lmkd and most recently for us: https://fedoraproject.org/wiki/Changes/EnableEarlyoom That one's about swap but it's certainly possible to have issues even without swap. https://github.com/facebookincubator/oomd is also relevant. This all said - let's get a bit more data here about what's happening; in particular which process is being killed.
*** Bug 1809606 has been marked as a duplicate of this bug. ***
*** Bug 1814187 has been marked as a duplicate of this bug. ***
*** Bug 1811924 has been marked as a duplicate of this bug. ***