Description of problem: When running a test unbounded resource pod on a node with eviction-soft and eviction-soft-grace-period parameters specified for memory.available, the node never evicts the problem container because the grace-period counter is continuously reset. Version-Release number of selected component (if applicable): 3.6 How reproducible: Very. Create a problematic pod and let it use all of the available memory. Examine the list of pods and the atomic-openshift-node logs and observe how it constantly restarts. Steps to Reproduce: 1. Configure eviction-soft and eviction-soft-grace-period in node-config.yaml 2. Restart aos-node service to let parameters take effect 3. Create problematic pod scheduled to the node and observe the logs of aos-node to see that the grace-period counter is constantly reset Actual results: Pod is not evicted from node. Grace period counter is constantly reset. Expected results: Pod should be evicted (or at least, SOME pod should be evicted to free up memory) Additional info: I'm attaching a DC that deploys a stress-ng container I created and hosted on docker hub. I'm also going to upload my node-config.yaml that shows how I configured soft-eviction. To allow this pod to work, you need to create a service account called stress-ng-user with the anyuid scc in the project (so that stress-ng runs privileged) `oc create serviceaccount stress-ng-user` `oc adm policy add-scc-to-user anyuid -z stress-ng-user` `oc create -f stress-ng-dc.yaml` `oc edit dc/stress-ng` > Edit the --vm-bytes of the environment variable to use all available memory on the machine, so that free -m "Available" reports less than the eviction-soft threshold `oc scale dc/stress-ng --replicas=1`
Created attachment 1391681 [details] stress-ng dc that will deploy a stress-ng pod. needs a privileged serviceaccount user with anyuid scc
Created attachment 1391682 [details] node-config.yaml
Created attachment 1391684 [details] output of top, output of free -m, atomic-openshift-node logs
Origin PR: https://github.com/openshift/origin/pull/18488
Correct Origin PR: https://github.com/openshift/origin/pull/18490
OSE PR: https://github.com/openshift/ose/pull/1055
Checked with # openshift version openshift v3.6.173.0.104 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 and the issue can not be reproduced.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1106