Created attachment 1370290 [details] events Description: Deployment failed due to the pre-hook pod failed: exceeded quota Version-Release number of selected component (if applicable): free-int v3.8.18 (online version 3.6.0.83) How reproducible: always Steps to Reproduce: 1. oc new-app --template=cakephp-mysql-persistent 2.Check deployment Actual results: Deployment failed: 2017-12-20 12:54:32 +0800 CST 2017-12-20 12:54:32 +0800 CST 1 cakephp-mysql-persistent.1501e7ab308bb005 DeploymentConfig Warning Failed cakephp-mysql-persistent-2-deploy The pre-hook failed: couldn't create lifecycle pod for cakephp-mysql-persistent-2: pods "cakephp-mysql-persistent-2-hook-pre" is forbidden: exceeded quota: compute-resources-timebound, requested: limits.cpu=1,limits.memory=512Mi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi, aborting rollout of cindytest/cakephp-mysql-persistent-2 Expected results: Depolyment succeed Additional info: oc get events: cakephp-mysql-persistentDeployment ConfigWarningFailed Create Error creating deployer pod: pods "cakephp-mysql-persistent-1-deploy" is forbidden: exceeded quota: compute-resources-timebound, requested: limits.cpu=1,limits.memory=512Mi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi 6 times in the last 11 minutes
I'm not seeing any evidence of a product bug here (e.g. in the quota system). Here is what I am seeing: * The cakephp-mysql-persistent creates a build and two deployments via new-app * The NotTerminating resource quota configured in free-int restricts limits.cpu=2 and limits.memory=1Gi * The default limits established by the LimitRange deployed to free-int are cpu=1 and memory=512Mi This are a lot of transient containers in flight: 3 for the app build, 1 for the app deployment, 1 for the mysql deployment. There is 1 persistent container to compete with (mysql). The namespace seems to constantly skirt the quota limits. Sometimes the timing of pod Terminating transitions and quota recalculation works out and the whole process succeeds, sometimes it doesn't. I think this is the timeline for the failure mode represented in the attached event log: 1. App build container created 2. App build container created 3. App build container created 4. mysql deployer container created 5. App deployer container fails to create due to quota (but is retryable) 6. mysql app container created 7. App deployer pod container created (due to retry) 8. App deployer pre-hook container fails to create, which is fatal to the deployment (retry still doesn't seem to be supported for pre hooks with the recreate strategy) My sense is that the problem lies in the balance of cluster quota, limit range defaulting, possibly ClusterResourceOverrideConfig, and the app template itself to provide a consistent user experience. Is there some evidence of a core product bug I have missed?
I'd like for Abhishek to review my assessment and consider reassigning this to an Online component.
Dan Mace and I had briefly discussed this on IRC and I agree with his assessment. I don't think the scheduler is doing something different for this template (3.7 vs 3.8). I would like to know if this is actually a regression. Can QE please confirm if this template was consistently being deployed on first attempt with OCP 3.7 and is now failing (at least 50% of the time) to deploy successfully at the first attempt with OCP 3.8 / 3.9?
also exists in free-int OpenShift Master: v3.11.0-0.21.0 Kubernetes Master: v1.11.0+d4cacc0 OpenShift Web Console: v3.11.0-0.21.0