Bug 1319700 - Support for pod rescheduler to ensure cluster stays balanced and shows proper status of the pods.
Summary: Support for pod rescheduler to ensure cluster stays balanced and shows proper...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RFE
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Derek Carr
QA Contact: MinLi
URL:
Whiteboard:
: 1314624 (view as bug list)
Depends On:
Blocks: 1267746
TreeView+ depends on / blocked
 
Reported: 2016-03-21 11:06 UTC by Jaspreet Kaur
Modified: 2021-12-10 14:36 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-11 21:17:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2141631 0 None None None 2017-12-08 23:53:38 UTC
Red Hat Knowledge Base (Solution) 2212441 0 None None None 2021-09-09 11:47:49 UTC

Description Jaspreet Kaur 2016-03-21 11:06:49 UTC
3. What is the nature and description of the request? 
When the OutOfMemoryKiller of RHEL k01602613ills processes on a node (e.g. a pod is killed and not running anymore), this should be reflected in 'oc get pod --all-namspaces -o wide', now it shows running, although it is down.


As an admin of the cluster, if my nodes are overloaded pods are gradually spread to other nodes.

Doing some test using nexus:
- oc new-project test-x
- oc new-app sonatype/nexus
- (strategy BestEffort, no resource requests, no limits)

As the footprint of nexus is higher, I reached the limit with regards to memory, for the node I was using, 32 GB of memory are available.

In the end (at around 55-60 pods), I was in the following situation:
- most of the pods were not running anymore, as the OutOfMemoryKiller of Linux was killing processes
(/var/log/messages showed many of Mar 16 13:45:32 ip-10-191-1-114 kernel: Out of memory: Kill process 120303 (java) score 1016 or sacrifice child)
- docker was in an unusable state
(docker ps was not showing anything anymore, not returning a result, deleting a pod ended in the pod staying on Terminating, docker restart required...)
- OpenShift (or kubernetes) still showed that everything was running
(oc get pod --all-namespaces -o wide -> all pods Running)

So the environment got into an unstable state. This should not be possible.

Issue1
oc get pod --all-namespaces -o wide should reflect the truth. Although pods were not running anymore, the command shows that everything is running.

Issue2
It should not be possible to schedule a pod to a node, when no more resources are available (e.g. no more memory or 90% of the available memory are already used or  Verify that a cluster at high utilization (> 90%) sees pods moved to a newly added node after rebalancing occurs


4. Why does the customer need this? (List the business requirements here)

Nobody would recognize/expect that a pod is down, if it still shows running. Effective management of the cluster with proper evacuation of the pods based on the resources of the cluster

5. How would the customer like to achieve this? (List the functional requirements here) 
I don't know the details of current implementation, but I guess some polling on the pods should be done, to see if etcd corresponds with real life.

6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented. 
On a node with 32GB memory, create about 55-60 projects with a running nexus (oc new-app sonatype/nexus). When the OOM kicks in, check what oc get pod shows.

7. Is there already an existing RFE upstream or in Red Hat Bugzilla? 

No

8. Does the customer have any specific timeline dependencies and which release would they like to target ? 
asap

9. Is the sales team involved in this request and do they have any additional input? Red Hat Consultant on site, account team fully aware of the request. 
?

10. List any affected packages or components. 
?

11. Would the customer be able to assist in testing this functionality if implemented? 
yes

Comment 2 Dan McPherson 2016-04-13 19:51:22 UTC
*** Bug 1314624 has been marked as a duplicate of this bug. ***

Comment 6 Derek Carr 2016-05-09 15:10:14 UTC
A pod with best effort quality of service is able to consume as much memory as is available on the node.  Running large numbers of best effort pods on a node increases the risk of inducing a system OOM as the scheduler is not placing the pods on nodes with any understanding of their potential resource requirements.

Each container in a pod is given an OOM_SCORE_ADJ value that is evaluated in response to an OOM event on the node to determine which containers to kill to reclaim memory.  The value range is -1000 to 1000 where the higher the number the more likely you are to be targeted by the oom_killer.  Best-effort pods are given an OOM_SCORE_ADJ of 1000 so they are targeted first in response to OOM events.  Guaranteed processes are given a score of -998.  Burstable containers (make a request and an optional limit) and are scored in the range of 2-999 based on how much memory it is consuming relative to its request.  So if a container is under its request, it will have a lower value, and if a container is over its request, it will have a higher value.  This means the oom_killer will target best-effort, burstable, and guaranteed containers in that order.  System daemons should have an OOM_SCORE_ADJ of -999 so its not targeted (docker, openshift-node).  An OOM event can make the node unstable for an extended period of time as it starves cpu while reclaiming memory.

When a container is killed by the oom_killer, the container may be restarted based on the restart policy on the pod definition.  If the restart policy is always, the container will just restart, and the pod will continue to report running status.

There is work planned for OpenShift 3.3 and Kubernetes 1.3 to support evictions when the node is reaching resource pressure conditions as documented here: https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/kubelet-eviction.md

With the above feature, the node will monitor available memory exceeding an admin defined trigger and attempt to evict pods (i.e. fail a pod) from the node when memory is under pressure before inducing a system OOM.

Comment 11 Eric Rich 2017-02-08 17:37:54 UTC
Should this not be solved by https://docs.openshift.com/container-platform/3.4/admin_guide/out_of_resource_handling.html ?

Comment 27 Bryan Yount 2018-07-31 00:15:34 UTC
The descheduler is Tech Preview in OpenShift 3.10:

https://docs.openshift.org/latest/admin_guide/scheduling/descheduler.html

Comment 32 Rory Thrasher 2019-06-11 21:17:06 UTC
Red Hat is moving OpenShift feature requests to a new JIRA RFE system. This bz (RFE) has been identified as a feature request which is still being evaluated and has been moved.

As the new Jira RFE system is not yet public, Red Hat Support can help answer your questions about your RFEs via the same support case system.

https://.jira.coreos.com/browse/RFE-168


Note You need to log in before you can comment on or make changes to this bug.