Bug 1392980

Summary: Limit the number of pods with the starting state on a node
Product: OpenShift Container Platform Reporter: Frederic Giloux <fgiloux>
Component: RFEAssignee: Derek Carr <decarr>
Status: CLOSED CURRENTRELEASE QA Contact: Xiaoli Tian <xtian>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.2.1CC: aos-bugs, jeder, jmencak, jokerman, mmccomas, tkatarki
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-18 19:55:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Frederic Giloux 2016-11-08 15:27:19 UTC
Description of problem:

Tomcat and other Java applications are CPU intensive at startup compared to normal run. When a node dies or is evacuated lots of such applications may be restarted at once on other nodes. The startup process takes significantly longer: 2-3 minutes instead of 30s in such cases, which has for consequences that the readiness probes fail and... that the pods get restarted. It is agreed that the expiration period could be manually changed for the readiness check when a node needs to get evacuated but it is a manual process that we would like to avoid and it would not be doable when a node dies. The issue may vanish when additional nodes are provisioned (better distribution of the load) but it would be nice to have a way to limit the number of "starting" pods on a node.

Version-Release number of selected component (if applicable):

3.2.1.13 and 3.2.1.15 

How reproducible:
 and several pods with Tomcat running on them. Evacuate the node.

Steps to Reproduce:
1. Have a small cluster with 2 nodes
2. Starts a Tomcat application (quickstart)
3. Scale a single Tomcat instance and take note of the time required for it
4. Scale so that there are several instances running
5. Evacuate one of the nodes

Actual results:

readiness probe fails.

Expected results:

to be able to limit the number of pods starting at the same time on the remaining node so that additional pods get only started when the first ones are successfully running.

Comment 2 Jeremy Eder 2016-11-14 15:43:11 UTC
We've built into our cluster-loader utility something called a tuningset, which is a way of enforcing some "pacing" on clients.

These are ways to set intervals and rates so that we can load at maximum speed, while keeping the system stable.

We had to do this in openshift v2 as well, but v3 is even worse in terms of parallelism.  In the case of container creation, much of the failures or fragility can be pinned to docker.

We're prototyping a way to measure current "busy-ness" of docker by reading it's API, and using that as auto-tuning backpressure that our client will use.  In this way, we can load as fast as docker can safely go.  I don't yet know if docker will have the features we need, and it might also not be the only source of information we need.

It might be beneficial to look not only at docker but at the system resource profile as well, potentially detecting storage I/O saturation and pacing (queuing) client requests.  Essentially we need a way to "protect" docker (and any other runtime) from Kubernetes.  Amazon does this by rate-limiting their API to protect their control plane.

Comment 3 Derek Carr 2016-12-12 21:16:40 UTC
This is an RFE to rate limit via QPS the number of container start operations that are made to the container runtime from the kubelet.

Comment 5 Tushar Katarki 2019-04-18 19:55:23 UTC
I think this ask has been discussed upstream thoroughly. See https://github.com/kubernetes/kubernetes/issues/3312

It point to some best practices and other features and issues that can address the issue.

I don't think there is any other work in upstream in this problem space.