1542166 – eviction-soft never triggered because grace-period counter is reset

Bug 1542166 - eviction-soft never triggered because grace-period counter is reset

Summary: eviction-soft never triggered because grace-period counter is reset

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.6.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.6.z
Assignee:	ravig
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1542769
TreeView+	depends on / blocked

Reported:	2018-02-05 18:15 UTC by Chris Kim
Modified:	2018-04-12 06:03 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1542769 (view as bug list)
Environment:
Last Closed:	2018-04-12 06:02:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
stress-ng dc that will deploy a stress-ng pod. needs a privileged serviceaccount user with anyuid scc (1.44 KB, text/plain) 2018-02-05 18:16 UTC, Chris Kim	no flags	Details
node-config.yaml (1.27 KB, text/plain) 2018-02-05 18:17 UTC, Chris Kim	no flags	Details
output of top, output of free -m, atomic-openshift-node logs (20.84 KB, text/plain) 2018-02-05 18:19 UTC, Chris Kim	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3344931	0	None	None	None	2018-02-06 04:50:12 UTC
Red Hat Product Errata	RHBA-2018:1106	0	None	None	None	2018-04-12 06:03:12 UTC

Description Chris Kim 2018-02-05 18:15:11 UTC

Description of problem:
When running a test unbounded resource pod on a node with eviction-soft and eviction-soft-grace-period parameters specified for memory.available, the node never evicts the problem container because the grace-period counter is continuously reset. 

Version-Release number of selected component (if applicable):
3.6

How reproducible:
Very. Create a problematic pod and let it use all of the available memory. Examine the list of pods and the atomic-openshift-node logs and observe how it constantly restarts.

Steps to Reproduce:
1. Configure eviction-soft and eviction-soft-grace-period in node-config.yaml
2. Restart aos-node service to let parameters take effect
3. Create problematic pod scheduled to the node and observe the logs of aos-node to see that the grace-period counter is constantly reset

Actual results:
Pod is not evicted from node. Grace period counter is constantly reset.

Expected results:
Pod should be evicted (or at least, SOME pod should be evicted to free up memory)

Additional info:
I'm attaching a DC that deploys a stress-ng container I created and hosted on docker hub. I'm also going to upload my node-config.yaml that shows how I configured soft-eviction. To allow this pod to work, you need to create a service account called stress-ng-user with the anyuid scc in the project (so that stress-ng runs privileged)

`oc create serviceaccount stress-ng-user`
`oc adm policy add-scc-to-user anyuid -z stress-ng-user`
`oc create -f stress-ng-dc.yaml`
`oc edit dc/stress-ng` > Edit the --vm-bytes of the environment variable to use all available memory on the machine, so that free -m "Available" reports less than the eviction-soft threshold
`oc scale dc/stress-ng --replicas=1`

Comment 1 Chris Kim 2018-02-05 18:16:55 UTC

Created attachment 1391681 [details]
stress-ng dc that will deploy a stress-ng pod. needs a privileged serviceaccount user with anyuid scc

Comment 2 Chris Kim 2018-02-05 18:17:54 UTC

Created attachment 1391682 [details]
node-config.yaml

Comment 3 Chris Kim 2018-02-05 18:19:57 UTC

Created attachment 1391684 [details]
output of top, output of free -m, atomic-openshift-node logs

Comment 6 Avesh Agarwal 2018-02-06 23:39:33 UTC

Origin PR:
https://github.com/openshift/origin/pull/18488

Comment 11 Avesh Agarwal 2018-02-07 00:41:09 UTC

Correct Origin PR:
https://github.com/openshift/origin/pull/18490

Comment 12 Avesh Agarwal 2018-02-08 19:14:53 UTC

OSE PR:
https://github.com/openshift/ose/pull/1055

Comment 14 weiwei jiang 2018-02-22 08:39:03 UTC

Checked with 
# openshift version 
openshift v3.6.173.0.104
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

and the issue can not be reproduced.

Comment 17 errata-xmlrpc 2018-04-12 06:02:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1106

Note You need to log in before you can comment on or make changes to this bug.