1801824 – [4.3] A pod that gradually leaks memory causes node to become unreachable for 10 minutes

Bug 1801824 - [4.3] A pod that gradually leaks memory causes node to become unreachable for 10 minutes

Summary: [4.3] A pod that gradually leaks memory causes node to become unreachable for...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1811159 (view as bug list)
Depends On:	1806786
Blocks:	OCP/Z_4.2 1801826 1801829
TreeView+	depends on / blocked

Reported:	2020-02-11 17:03 UTC by Ryan Phillips
Modified:	2023-09-07 21:48 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1800319
Clones:	1801826 (view as bug list)
Environment:
Last Closed:	2020-05-11 21:20:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1458	0	None	closed	[release-4.3] Bug 1801824: kubelet: add more system reservation to protect node	2020-11-13 05:59:56 UTC
Red Hat Product Errata	RHBA-2020:2006	0	None	None	None	2020-05-11 21:20:54 UTC

Description Ryan Phillips 2020-02-11 17:03:41 UTC

+++ This bug was initially created as a clone of Bug #1800319 +++

Creating a memory hogger pod (that should be evicted / OOM killed) instead of being safely handled by the node causes the node to become unreachable for >10m.  On the node, the kubelet appears to be running but can't heartbeat the apiserver.  Also, the node appears to think that the apiserver deleted all the pods (DELETE("api") in logs) which is not correct - no pods except the oomkilled one should be evicted / deleted.

Recreate

1. Create the attached kill-node.yaml on the cluster (oc create -f kill-node.yaml)
2. Wait 2-3 minutes while memory fills up on the worker

Expected:

1. memory-hog pod is oomkilled and/or evicted (either would be acceptable)
2. the node remains ready

Actual:

1. Node is tainted as unreachable, heartbeats stop, and it takes >10m for it to recover
2. After recovery, events are delivered

As part of fixing this, we need to add an e2e tests to the origin disruptive suite that triggers this (and add eviction tests, because this doesn't seem to evict anything).

--- Additional comment from Clayton Coleman on 2020-02-06 21:14:33 UTC ---

Once this is fixed we need to test against 4.3 and 4.2 and backport if it happens - this can DoS a node.

Comment 1 Ryan Phillips 2020-03-06 18:24:04 UTC

*** Bug 1811159 has been marked as a duplicate of this bug. ***

Comment 2 Ryan Phillips 2020-03-10 14:50:19 UTC

*** Bug 1811159 has been marked as a duplicate of this bug. ***

Comment 3 W. Trevor King 2020-03-14 06:25:21 UTC

How does this relate to bug 1808429, which has the same subject and also targets 4.3.z?  Is this one a dup of that one?

Comment 8 errata-xmlrpc 2020-05-11 21:20:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2006

Note You need to log in before you can comment on or make changes to this bug.