Bug 1802687 - A pod that gradually leaks memory causes node to become unreachable for 10 minutes
Summary: A pod that gradually leaks memory causes node to become unreachable for 10 mi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1809606 1811924 (view as bug list)
Depends On: 1800319 1806786 1808429 1810136
Blocks: OCP/Z_4.2 1766237 1801826 1801829 1808444 1814187 1814804
TreeView+ depends on / blocked
 
Reported: 2020-02-13 17:35 UTC by Clayton Coleman
Modified: 2020-05-21 18:09 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The kubepods.slice memory cgroup was not being set correctly, and was being set to max memory on the node. Consequence: This would cause the Kubelet to not reserve memory and CPU resources for system components (including kubelet and crio) causing kernel pauses and other OOM conditions in the cloud. Fix: The kubepods.slice memory limit is now set correctly. Result: Pods should be evicted when using [max-memory] - [system reservation].
Clone Of: 1800319
: 1825989 (view as bug list)
Environment:
Last Closed: 2020-05-21 18:09:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24568 0 None closed Bug 1802687: UPSTREAM: 88251: Partially fix incorrect configuration of kubepods.slice unit by kubelet 2021-02-11 09:25:20 UTC

Description Clayton Coleman 2020-02-13 17:35:07 UTC
Exhausting memory on the node should not cause a permanent failure.  The previous bug was fixed by increasing reservations - we must still understand the root failure and add e2e tests that prevent it.  High severity because customer environments will easily exceed 1Gi reservations.

This is a 4.4 GA blocker even with the higher reservation (which more realistically describes what we actually use on the system)

+++ This bug was initially created as a clone of Bug #1800319 +++

Creating a memory hogger pod (that should be evicted / OOM killed) instead of being safely handled by the node causes the node to become unreachable for >10m.  On the node, the kubelet appears to be running but can't heartbeat the apiserver.  Also, the node appears to think that the apiserver deleted all the pods (DELETE("api") in logs) which is not correct - no pods except the oomkilled one should be evicted / deleted.

Recreate

1. Create the attached kill-node.yaml on the cluster (oc create -f kill-node.yaml)
2. Wait 2-3 minutes while memory fills up on the worker

Expected:

1. memory-hog pod is oomkilled and/or evicted (either would be acceptable)
2. the node remains ready

Actual:

1. Node is tainted as unreachable, heartbeats stop, and it takes >10m for it to recover
2. After recovery, events are delivered

As part of fixing this, we need to add an e2e tests to the origin disruptive suite that triggers this (and add eviction tests, because this doesn't seem to evict anything).

--- Additional comment from Clayton Coleman on 2020-02-06 16:14:33 EST ---

Once this is fixed we need to test against 4.3 and 4.2 and backport if it happens - this can DoS a node.

Comment 1 Colin Walters 2020-02-14 13:34:32 UTC
Today we don't take control over the OOM handling.  To the best of my knowledge, if one has pods configured without hard limits (common) then what's going to happen is the default OOM killer will be invoked and it can kill any process.

For most of our core processes, systemd will restart them if they're killed, but we don't regularly test that.  Adding reservations makes it less likely we'll overcommit in a situation with hard limits.

The recent trend has been userspace policy driven OOM handling, e.g.
https://source.android.com/devices/tech/perf/lmkd
and most recently for us:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom

That one's about swap but it's certainly possible to have issues even without swap.

https://github.com/facebookincubator/oomd

is also relevant.

This all said - let's get a bit more data here about what's happening; in particular which process is being killed.

Comment 6 Ryan Phillips 2020-03-06 18:42:39 UTC
*** Bug 1809606 has been marked as a duplicate of this bug. ***

Comment 8 Ryan Phillips 2020-03-17 13:41:43 UTC
*** Bug 1814187 has been marked as a duplicate of this bug. ***

Comment 9 Ryan Phillips 2020-03-17 13:41:46 UTC
*** Bug 1811924 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.