Bug 1415196

Summary: Rolling deployments fail when Quota Limit is reached
Product: OpenShift Container Platform Reporter: Will Gordon <wgordon>
Component: openshift-controller-managerAssignee: Michal Fojtik <mfojtik>
Status: CLOSED EOL QA Contact: zhou ying <yinzhou>
Severity: low Docs Contact:
Priority: low    
Version: unspecifiedCC: aos-bugs, mfojtik, wgordon
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-23 12:50:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Will Gordon 2017-01-20 14:27:23 UTC
Description of problem:
   My current deployment is configured to have a max of 25% pods unavailable during a rolling deployment. I currently have 8 pods at 250MB, and when I modify the config to increase the memory limit to 251MB, no pods are scaled down to accommodate the rolling deployment, as expected given the description of Maximum NUmber of Unavailable Pods, "The maximum number of pods that can be unavailable during the rolling deployment."


Version-Release number of selected component (if applicable):
   v3.4.0.39 (online version 3.4.0.13)


How reproducible:
   Everytime


Steps to Reproduce:
1. Scale up pods to a maximum number given quota limits
2. Perform a deploy config change
3. Pods do not scale down to accommodate the rolling deploy


Actual results:
   Pods do not scale down by 25% to accommodate for the rolling deploy

Expected results:
   Pods should scale down from 8 to 6 to allow for 2 pods to scale up with the new config


Additional info:

Comment 1 Michal Fojtik 2017-01-20 15:09:56 UTC
Can you please share more details about your deployment? Did you deployment complete or error? How the pods scale down (by one?). 

Can you also provide events from the namespace/deployment?

Comment 2 Will Gordon 2017-01-20 18:01:08 UTC
My deployment consists of a custom docker image with httpd serving static files. The image is already stored in the project, and scaling up/down works without issue. My understanding for rolling deployments was whenever a new deployment is triggered, pods are supposed to scale down (within the limits, in this case only 25% at a time) before scaling up the new pods.

However this does not seem to be the case when hitting Quota limits. When the quota limit is reached, the rolling deployment attempts to deploy the new pods without scaling down available pods.

Actual results:
8 "a" pods (8) -> 8 "a" pods & 2 "b" pods (10) -> 6 "a" pods & 2 "b" pods (8)

Expected results:
8 "a" pods (8) -> 6 "a" pods (6) -> 6 "a" pods & 2 "b" pods (8)


The deployment (after the config change) will fail with an error.

The associated log messages:

--> Scaling up openshift-5 from 0 to 8, scaling down openshift-4 from 8 to 0 (keep 6 pods available, don't exceed 10 pods)
    Scaling openshift-5 up to 2
-->  FailedCreate: openshift-5 Error creating: pods "openshift-5-" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=490m,limits.memory=251Mi, used: limits.cpu=3904m,limits.memory=2000Mi, limited: limits.cpu=4,limits.memory=2Gi
-->  FailedCreate: openshift-5 Error creating: pods "openshift-5-" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=490m,limits.memory=251Mi, used: limits.cpu=3904m,limits.memory=2000Mi, limited: limits.cpu=4,limits.memory=2Gi
-->  FailedCreate: openshift-4 Error creating: pods "openshift-4-" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=488m,limits.memory=250Mi, used: limits.cpu=3904m,limits.memory=2000Mi, limited: limits.cpu=4,limits.memory=2Gi
-->  FailedCreate: openshift-4 Error creating: pods "openshift-4-" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=488m,limits.memory=250Mi, used: limits.cpu=3904m,limits.memory=2000Mi, limited: limits.cpu=4,limits.memory=2Gi
error: timed out waiting for "openshift-5" to be synced

Comment 3 Michal Fojtik 2017-02-02 14:02:03 UTC
// @kargakis

Deployments have been working like that since ever. The problem seems to be that we always scale up first if the user uses maxSurge, so the new replication controller is scaled up, the pods cannot be created, the rc observed generation is not updated.

Comment 4 Michal Fojtik 2017-02-02 14:04:39 UTC
QE: Can you please verify if this is still an issue on the latest Origin?

Comment 5 Michail Kargakis 2017-02-02 14:47:27 UTC
The deployment will always scale up first if you use maxSurge. When you are constrained by quota, it is recommended to set maxSurge to zero and solely use maxUnavailable.

Comment 6 Michail Kargakis 2017-02-02 14:56:09 UTC
Correction:

> the rc observed generation is not updated.

The rc observedGeneration is synced, the problem seems to be that the underlying scaler is waiting for the created replicas to match the desired replicas.