Bug 1509289

Summary: [3.5] eviction manager sometimes evicts all pods
Product: OpenShift Container Platform Reporter: Carsten Lichy-Bittendorf <clichybi>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED ERRATA QA Contact: DeShuai Ma <dma>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.5.1CC: aos-bugs, fcami, jokerman, mmccomas, sjenning, wjiang
Target Milestone: ---   
Target Release: 3.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Fixes an issue where slow pod deletion on a node under eviction pressure could result in the eviction of all pods.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-12 05:59:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs where eviction behaves badly none

Description Carsten Lichy-Bittendorf 2017-11-03 13:32:20 UTC
Created attachment 1347364 [details]
logs where eviction behaves badly

Description of problem:
since upgrade to OCP 3.5 sometimes the eviction manager deletes all pods from a node under pressure

Version-Release number of selected component (if applicable):


How reproducible:

nodes have this eviction setting in /etc/origin/node/node-config.yaml

kubeletArguments:
  eviction-hard:
  - memory.available<20%

Before the eviction :
see files describe-node-before.txt and pods-before.txt

launch a memory hog on the node to slightly raise the limit :
see the file Screenshot-2017-11-3 Grafana - PaaS cluster view(1).png at 11:13


Actual results:
- On some occasions all pods get killed, see files describe-node-after.txt, pods-after.txt and cdyi0544-journal.log.gz

Expected results:
- Eviction should select pod to kill to free memory, kill the pod and stop processing when the threshold fall back below 80%.

Additional info:
- looks like a regression as it worked on OCP3.4 as expected when tested last
- In the logs cdyi0544-journal.log.gz, there are many lines like that  :
W1103 11:18:20.779379   75247 eviction_manager.go:117] Failed to admit pod dc-springboot-1-ngvlm_springboot-sdev(5120a265-c080-11e7-8374-005056bf9134) - node has conditions: %v%!(EXTRA []api.NodeConditionType=[MemoryPressure]) which are not seen in a good run

Comment 10 Seth Jennings 2017-11-06 14:42:45 UTC
OCP 3.6 PR:
https://github.com/openshift/ose/pull/920

Backporting to 3.5 is problematic since the code has changed quite a bit.

Comment 16 weiwei jiang 2018-01-25 08:50:31 UTC
Checked with # openshift version
openshift v3.6.173.0.96
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

And now eviction will not evict all pods at the same time, so verify this issue.

Comment 19 errata-xmlrpc 2018-04-12 05:59:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1106