Bug 1262883
Summary: | build status when out of pods due to quota is not sufficiently explicit | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Erik M Jacobs <ejacobs> |
Component: | Build | Assignee: | Cesar Wong <cewong> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Wenjing Zheng <wzheng> |
Severity: | low | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.0.0 | CC: | aos-bugs, bparees, decarr, ejacobs, haowang, rcarvalh, xiuwang |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-01-29 20:58:18 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Erik M Jacobs
2015-09-14 14:33:48 UTC
Does it ultimately get created once your quota is available? I have no idea. I assume so? assuming so, i dont know what the bug is here? your build is in new state because there are no resources available to run it. describing the build tells you why it's not running. if you don't want to free up resources or wait for resources to free up, you can cancel the build. Failing the build immediately is not the right approach. As a user, how would I ever know to describe the build? Are we expecting a user to say "gee, it's been several minutes and my build hasn't started, I should go look"? This is currently a sub-optimal user experience. We provide no feedback to the user when they either start the build (new-build) or when they ask about the build (get-build) that indicates there is a problem with the quota. Additionally, any external automation tool is going to have a hard time figuring out what's going on, because the tool would have to use an unexpected API call (describe -- is this even an API call?) to look at the event history and see that a particular event occured (forbidden). While you may not like the approach of failing the build, this is still a bug - the build's status ("can never be scheduled due to quota") is not accurately reflected in the "get" command. "New" may be true but is insufficient information. I've changed the subject of the bug to more accurately reflect what's going on -- the information provided to the user with "get" is insufficient. We are expecting too much of the user to determine what's going on in this particular case. if/when this is resolved, we should solve the entirety of this issue: https://github.com/openshift/origin/issues/3847 This PR should fix this BZ: https://github.com/openshift/origin/pull/4872 I've put some comments in the BZ. @Erik, finally the error info also got into the output of `oc get builds`: https://github.com/openshift/origin/pull/4909 So now the build.status.Message and build.status.Reason reflect the error condition (including the one reported here), and those can be seen in `oc describe build` and `oc get build`. Please take a look. Yeah that PR looks pretty good. I think it will address this BZ. The build status will be marked to pending with message (CannotCreateBuildPod) after hit the project pods limits. But,after free up my resource, the pending build keeps pending with (CannotCreateBuildPod) IMO,it's not acceptable, right? #oc get builds NAME TYPE FROM STATUS STARTED DURATION ruby-sample-build-2 Source Git Failed 39 minutes ago 2m2s ruby-sample-build-3 Source Git Complete 36 minutes ago 3m1s ruby-sample-build-4 Source Git Pending (CannotCreateBuildPod) # oc describe project xiuwangquota Name: xiuwangquota Created: About an hour ago Labels: <none> Annotations: openshift.io/description= openshift.io/display-name= openshift.io/sa.scc.mcs=s0:c13,c12 openshift.io/sa.scc.supplemental-groups=1000180000/10000 openshift.io/sa.scc.uid-range=1000180000/10000 Display Name: <none> Description: <none> Status: Active Node Selector: <none> Quota: Name: quota Resource Used Hard -------- ---- ---- cpu 600m 1 memory 300Mi 750Mi pods 2 4 replicationcontrollers 2 10 resourcequotas 1 1 services 2 10 Resource limits: Name: limits Type Resource Min Max Default ---- -------- --- --- --- Pod cpu 10m 500m - Pod memory 5Mi 750Mi - Container memory 5Mi 750Mi 100Mi Container cpu 10m 500m 100m Did the build ever enter the running state? I would expect the message to remain "Pending (CannotCreateBuildPod)" until the system recognizes resources are available and starts running the pod, at which point the state should change to "Running" if the pod never entered the "Running" state after resources became available, I think that's a scheduler issue. The pending(CannotCreateBuildPod) build never entered the "Running" state after resources became available. But trigger a new build will be running if the resources are available. Derek, who does this bug need to go to? Sounds like a node/scheduling problem. Ben, I read the bug and it's not clear to me that we have shown that the build pod was ever actually created. Is there output that shows the build pod was in fact created? If so, we could look at events around that pod to know why it may not have been scheduled. Right now, best I can tell in the discussion is that no build pod was created for builds that were in this state. Thanks, Derek Thanks Derek, you're right, we have a bug here in which we 1) update the build phase to Pending 2) hit an error creating the build pod (in this case a quota limit) 3) still end up committing the updated build object because we're trying to reflect the error from (2) but this also ends up reflecting the phase change from (1). We need to not update the build phase if an error occurs creating the build pod. Turns out Cesar already has a pull that should fix this: https://github.com/openshift/origin/pull/5743 incidentally the "never schedules" portion of this bug is really a dupe of: https://bugzilla.redhat.com/show_bug.cgi?id=1278232 dupe bug , already verified from https://bugzilla.redhat.com/show_bug.cgi?id=1278232#c10 |