Bug 1542166 - eviction-soft never triggered because grace-period counter is reset
Summary: eviction-soft never triggered because grace-period counter is reset
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.6.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.6.z
Assignee: ravig
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks: 1542769
TreeView+ depends on / blocked
 
Reported: 2018-02-05 18:15 UTC by Chris Kim
Modified: 2018-04-12 06:03 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1542769 (view as bug list)
Environment:
Last Closed: 2018-04-12 06:02:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
stress-ng dc that will deploy a stress-ng pod. needs a privileged serviceaccount user with anyuid scc (1.44 KB, text/plain)
2018-02-05 18:16 UTC, Chris Kim
no flags Details
node-config.yaml (1.27 KB, text/plain)
2018-02-05 18:17 UTC, Chris Kim
no flags Details
output of top, output of free -m, atomic-openshift-node logs (20.84 KB, text/plain)
2018-02-05 18:19 UTC, Chris Kim
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3344931 0 None None None 2018-02-06 04:50:12 UTC
Red Hat Product Errata RHBA-2018:1106 0 None None None 2018-04-12 06:03:12 UTC

Description Chris Kim 2018-02-05 18:15:11 UTC
Description of problem:
When running a test unbounded resource pod on a node with eviction-soft and eviction-soft-grace-period parameters specified for memory.available, the node never evicts the problem container because the grace-period counter is continuously reset. 

Version-Release number of selected component (if applicable):
3.6

How reproducible:
Very. Create a problematic pod and let it use all of the available memory. Examine the list of pods and the atomic-openshift-node logs and observe how it constantly restarts.

Steps to Reproduce:
1. Configure eviction-soft and eviction-soft-grace-period in node-config.yaml
2. Restart aos-node service to let parameters take effect
3. Create problematic pod scheduled to the node and observe the logs of aos-node to see that the grace-period counter is constantly reset

Actual results:
Pod is not evicted from node. Grace period counter is constantly reset.

Expected results:
Pod should be evicted (or at least, SOME pod should be evicted to free up memory)

Additional info:
I'm attaching a DC that deploys a stress-ng container I created and hosted on docker hub. I'm also going to upload my node-config.yaml that shows how I configured soft-eviction. To allow this pod to work, you need to create a service account called stress-ng-user with the anyuid scc in the project (so that stress-ng runs privileged)

`oc create serviceaccount stress-ng-user`
`oc adm policy add-scc-to-user anyuid -z stress-ng-user`
`oc create -f stress-ng-dc.yaml`
`oc edit dc/stress-ng` > Edit the --vm-bytes of the environment variable to use all available memory on the machine, so that free -m "Available" reports less than the eviction-soft threshold
`oc scale dc/stress-ng --replicas=1`

Comment 1 Chris Kim 2018-02-05 18:16:55 UTC
Created attachment 1391681 [details]
stress-ng dc that will deploy a stress-ng pod. needs a privileged serviceaccount user with anyuid scc

Comment 2 Chris Kim 2018-02-05 18:17:54 UTC
Created attachment 1391682 [details]
node-config.yaml

Comment 3 Chris Kim 2018-02-05 18:19:57 UTC
Created attachment 1391684 [details]
output of top, output of free -m, atomic-openshift-node logs

Comment 6 Avesh Agarwal 2018-02-06 23:39:33 UTC
Origin PR:
https://github.com/openshift/origin/pull/18488

Comment 11 Avesh Agarwal 2018-02-07 00:41:09 UTC
Correct Origin PR:
https://github.com/openshift/origin/pull/18490

Comment 12 Avesh Agarwal 2018-02-08 19:14:53 UTC
OSE PR:
https://github.com/openshift/ose/pull/1055

Comment 14 weiwei jiang 2018-02-22 08:39:03 UTC
Checked with 
# openshift version 
openshift v3.6.173.0.104
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

and the issue can not be reproduced.

Comment 17 errata-xmlrpc 2018-04-12 06:02:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1106


Note You need to log in before you can comment on or make changes to this bug.