Bug 1801829

Summary:	[4.1] A pod that gradually leaks memory causes node to become unreachable for 10 minutes
Product:	OpenShift Container Platform	Reporter:	Ryan Phillips <rphillips>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Status:	CLOSED WONTFIX	QA Contact:	Sunil Choudhary <schoudha>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	aos-bugs, ccoleman, cfillekes, hannsj_uhl, jokerman, schoudha
Target Milestone:	---
Target Release:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1801826	Environment:
Last Closed:	2020-02-25 21:46:30 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1800319, 1801824, 1801826, 1802687, 1806786, 1808429, 1808444, 1810136
Bug Blocks:	1765215

Description Ryan Phillips 2020-02-11 17:14:05 UTC

+++ This bug was initially created as a clone of Bug #1801826 +++

+++ This bug was initially created as a clone of Bug #1801824 +++

+++ This bug was initially created as a clone of Bug #1800319 +++

Creating a memory hogger pod (that should be evicted / OOM killed) instead of being safely handled by the node causes the node to become unreachable for >10m.  On the node, the kubelet appears to be running but can't heartbeat the apiserver.  Also, the node appears to think that the apiserver deleted all the pods (DELETE("api") in logs) which is not correct - no pods except the oomkilled one should be evicted / deleted.

Recreate

1. Create the attached kill-node.yaml on the cluster (oc create -f kill-node.yaml)
2. Wait 2-3 minutes while memory fills up on the worker

Expected:

1. memory-hog pod is oomkilled and/or evicted (either would be acceptable)
2. the node remains ready

Actual:

1. Node is tainted as unreachable, heartbeats stop, and it takes >10m for it to recover
2. After recovery, events are delivered

As part of fixing this, we need to add an e2e tests to the origin disruptive suite that triggers this (and add eviction tests, because this doesn't seem to evict anything).

--- Additional comment from Clayton Coleman on 2020-02-06 21:14:33 UTC ---

Once this is fixed we need to test against 4.3 and 4.2 and backport if it happens - this can DoS a node.