Bug 1509289 - [3.5] eviction manager sometimes evicts all pods
Summary: [3.5] eviction manager sometimes evicts all pods
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.6.z
Assignee: Seth Jennings
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-03 13:32 UTC by Carsten Lichy-Bittendorf
Modified: 2018-04-12 05:59 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Fixes an issue where slow pod deletion on a node under eviction pressure could result in the eviction of all pods.
Clone Of:
Environment:
Last Closed: 2018-04-12 05:59:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs where eviction behaves badly (110.10 KB, application/x-gzip)
2017-11-03 13:32 UTC, Carsten Lichy-Bittendorf
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1106 0 None None None 2018-04-12 05:59:32 UTC

Description Carsten Lichy-Bittendorf 2017-11-03 13:32:20 UTC
Created attachment 1347364 [details]
logs where eviction behaves badly

Description of problem:
since upgrade to OCP 3.5 sometimes the eviction manager deletes all pods from a node under pressure

Version-Release number of selected component (if applicable):


How reproducible:

nodes have this eviction setting in /etc/origin/node/node-config.yaml

kubeletArguments:
  eviction-hard:
  - memory.available<20%

Before the eviction :
see files describe-node-before.txt and pods-before.txt

launch a memory hog on the node to slightly raise the limit :
see the file Screenshot-2017-11-3 Grafana - PaaS cluster view(1).png at 11:13


Actual results:
- On some occasions all pods get killed, see files describe-node-after.txt, pods-after.txt and cdyi0544-journal.log.gz

Expected results:
- Eviction should select pod to kill to free memory, kill the pod and stop processing when the threshold fall back below 80%.

Additional info:
- looks like a regression as it worked on OCP3.4 as expected when tested last
- In the logs cdyi0544-journal.log.gz, there are many lines like that  :
W1103 11:18:20.779379   75247 eviction_manager.go:117] Failed to admit pod dc-springboot-1-ngvlm_springboot-sdev(5120a265-c080-11e7-8374-005056bf9134) - node has conditions: %v%!(EXTRA []api.NodeConditionType=[MemoryPressure]) which are not seen in a good run

Comment 10 Seth Jennings 2017-11-06 14:42:45 UTC
OCP 3.6 PR:
https://github.com/openshift/ose/pull/920

Backporting to 3.5 is problematic since the code has changed quite a bit.

Comment 16 weiwei jiang 2018-01-25 08:50:31 UTC
Checked with # openshift version
openshift v3.6.173.0.96
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

And now eviction will not evict all pods at the same time, so verify this issue.

Comment 19 errata-xmlrpc 2018-04-12 05:59:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1106


Note You need to log in before you can comment on or make changes to this bug.