Bug 1527784

Summary: [Free-INT]Deployment failed due to the pre-hook pod failed: exceeded quota
Product: OpenShift Online Reporter: yufchang <yufchang>
Component: WebsiteAssignee: Abhishek Gupta <abhgupta>
Status: NEW --- QA Contact: yufchang <yufchang>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.xCC: abhgupta, haowang, jupierce, yufchang
Target Milestone: ---Keywords: OnlineStarter
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
events none

Description yufchang 2017-12-20 06:03:05 UTC
Created attachment 1370290 [details]
events

Description:
Deployment failed due to the  pre-hook pod failed: exceeded quota

Version-Release number of selected component (if applicable):
free-int v3.8.18 (online version 3.6.0.83)

How reproducible:
always

Steps to Reproduce:
1. oc new-app --template=cakephp-mysql-persistent
2.Check deployment

Actual results:
Deployment failed:
2017-12-20 12:54:32 +0800 CST   2017-12-20 12:54:32 +0800 CST   1         cakephp-mysql-persistent.1501e7ab308bb005            DeploymentConfig                                                 Warning   Failed                           cakephp-mysql-persistent-2-deploy       The pre-hook failed: couldn't create lifecycle pod for cakephp-mysql-persistent-2: pods "cakephp-mysql-persistent-2-hook-pre" is forbidden: exceeded quota: compute-resources-timebound, requested: limits.cpu=1,limits.memory=512Mi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi, aborting rollout of cindytest/cakephp-mysql-persistent-2

Expected results:
Depolyment succeed

Additional info:
oc get events:
cakephp-mysql-persistentDeployment ConfigWarningFailed Create     Error  creating deployer pod: pods "cakephp-mysql-persistent-1-deploy" is  forbidden: exceeded quota: compute-resources-timebound, requested:  limits.cpu=1,limits.memory=512Mi, used: limits.cpu=2,limits.memory=1Gi,  limited: limits.cpu=2,limits.memory=1Gi    
6 times in the last 11 minutes

Comment 2 Dan Mace 2018-01-05 19:46:37 UTC
I'm not seeing any evidence of a product bug here (e.g. in the quota system). Here is what I am seeing:

* The cakephp-mysql-persistent creates a build and two deployments via new-app
* The NotTerminating resource quota configured in free-int restricts limits.cpu=2 and limits.memory=1Gi
* The default limits established by the LimitRange deployed to free-int are cpu=1 and memory=512Mi

This are a lot of transient containers in flight: 3 for the app build, 1 for the app deployment, 1 for the mysql deployment. There is 1 persistent container to compete with (mysql). The namespace seems to constantly skirt the quota limits. Sometimes the timing of pod Terminating transitions and quota recalculation works out and the whole process succeeds, sometimes it doesn't.

I think this is the timeline for the failure mode represented in the attached event log:

1. App build container created
2. App build container created
3. App build container created
4. mysql deployer container created
5. App deployer container fails to create due to quota (but is retryable)
6. mysql app container created
7. App deployer pod container created (due to retry)
8. App deployer pre-hook container fails to create, which is fatal to the deployment (retry still doesn't seem to be supported for pre hooks with the recreate strategy)

My sense is that the problem lies in the balance of cluster quota, limit range defaulting, possibly ClusterResourceOverrideConfig, and the app template itself to provide a consistent user experience.

Is there some evidence of a core product bug I have missed?

Comment 3 Dan Mace 2018-01-05 20:17:10 UTC
I'd like for Abhishek to review my assessment and consider reassigning this to an Online component.

Comment 5 Abhishek Gupta 2018-01-23 23:29:43 UTC
Dan Mace and I had briefly discussed this on IRC and I agree with his assessment. I don't think the scheduler is doing something different for this template (3.7 vs 3.8). 

I would like to know if this is actually a regression. Can QE please confirm if this template was consistently being deployed on first attempt with OCP 3.7 and is now failing (at least 50% of the time) to deploy successfully at the first attempt with OCP 3.8 / 3.9?

Comment 8 yufchang 2018-08-29 03:38:39 UTC
also exists in free-int

OpenShift Master:    v3.11.0-0.21.0 
Kubernetes Master:    v1.11.0+d4cacc0 
OpenShift Web Console:     v3.11.0-0.21.0