Bug 1262883

Summary:	build status when out of pods due to quota is not sufficiently explicit
Product:	OpenShift Container Platform	Reporter:	Erik M Jacobs <ejacobs>
Component:	Build	Assignee:	Cesar Wong <cewong>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Wenjing Zheng <wzheng>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	3.0.0	CC:	aos-bugs, bparees, decarr, ejacobs, haowang, rcarvalh, xiuwang
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-01-29 20:58:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Erik M Jacobs 2015-09-14 14:33:48 UTC

Description of problem:
When you try to schedule a build and hit your pod limit, the build status will show "new" forever. The "forbidden" quota error never percolates up to the build status.

Version-Release number of selected component (if applicable):
tuned-profiles-openshift-node-3.0.1.0-1.git.525.eddc479.el7ose.x86_64
openshift-node-3.0.1.0-1.git.525.eddc479.el7ose.x86_64
openshift-3.0.1.0-1.git.525.eddc479.el7ose.x86_64
openshift-master-3.0.1.0-1.git.525.eddc479.el7ose.x86_64
openshift-sdn-ovs-3.0.1.0-1.git.525.eddc479.el7ose.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Set up a project with a quota limit on pods
2. try to schedule a build such that the pod limit is exceeded
3.

Actual results:
build shows "new":

[root@ose3-master ~]# oc get build
NAME             TYPE      STATUS     POD
ruby-example-1   Source    Complete   ruby-example-1-build
ruby-example-3   Source    New        ruby-example-3-build


Expected results:
Build should show failed or some other status.

Additional info:
[root@ose3-master ~]# oc start-build ruby-example; oc describe build ruby-example-3
ruby-example-3
Name:                   ruby-example-3
Created:                1 seconds ago
Labels:                 app=ruby-example,buildconfig=ruby-example
Build Config:           ruby-example
Status:                 New
Duration:               waiting for 1s
Build Pod:              ruby-example-3-build
Strategy:               Source
Image Reference:        DockerImage registry.access.redhat.com/openshift3/ruby-20-rhel7:latest
Source Type:            Git
URL:                    https://github.com/openshift/simple-openshift-sinatra-sti.git
Output to:              ImageStreamTag ruby-example:latest
Push Secret:            builder-dockercfg-5sky8
Events:
  FirstSeen                             LastSeen                        Count   From                   SubobjectPath    Reason          Message
  Mon, 14 Sep 2015 10:31:04 -0400       Mon, 14 Sep 2015 10:31:04 -0400 1       {build-controller }    failedCreate     Error creating: Pod "ruby-example-3-build" is forbidden: Limited to 3 pods

Comment 2 Ben Parees 2015-09-14 15:19:03 UTC

Does it ultimately get created once your quota is available?

Comment 3 Erik M Jacobs 2015-09-14 15:35:58 UTC

I have no idea. I assume so?

Comment 4 Ben Parees 2015-09-14 15:38:49 UTC

assuming so, i dont know what the bug is here?

your build is in new state because there are no resources available to run it.

describing the build tells you why it's not running.

if you don't want to free up resources or wait for resources to free up, you can cancel the build.

Failing the build immediately is not the right approach.

Comment 5 Erik M Jacobs 2015-09-14 16:01:05 UTC

As a user, how would I ever know to describe the build? Are we expecting a user to say "gee, it's been several minutes and my build hasn't started, I should go look"?

This is currently a sub-optimal user experience. We provide no feedback to the user when they either start the build (new-build) or when they ask about the build (get-build) that indicates there is a problem with the quota.

Additionally, any external automation tool is going to have a hard time figuring out what's going on, because the tool would have to use an unexpected API call (describe -- is this even an API call?) to look at the event history and see that a particular event occured (forbidden).

While you may not like the approach of failing the build, this is still a bug - the build's status ("can never be scheduled due to quota") is not accurately reflected in the "get" command. "New" may be true but is insufficient information.

Comment 6 Erik M Jacobs 2015-09-14 16:02:17 UTC

I've changed the subject of the bug to more accurately reflect what's going on -- the information provided to the user with "get" is insufficient. We are expecting too much of the user to determine what's going on in this particular case.

Comment 7 Ben Parees 2015-09-14 17:26:59 UTC

if/when this is resolved, we should solve the entirety of this issue:
https://github.com/openshift/origin/issues/3847

Comment 8 Rodolfo Carvalho 2015-09-30 16:00:48 UTC

This PR should fix this BZ:
https://github.com/openshift/origin/pull/4872

Comment 9 Erik M Jacobs 2015-09-30 16:05:18 UTC

I've put some comments in the BZ.

Comment 10 Rodolfo Carvalho 2015-10-22 09:46:15 UTC

@Erik, finally the error info also got into the output of `oc get builds`:

https://github.com/openshift/origin/pull/4909

So now the build.status.Message and build.status.Reason reflect the error condition (including the one reported here), and those can be seen in `oc describe build` and `oc get build`.

Please take a look.

Comment 11 Erik M Jacobs 2015-10-22 14:11:59 UTC

Yeah that PR looks pretty good. I think it will address this BZ.

Comment 12 XiuJuan Wang 2015-10-26 07:22:37 UTC

The build status will be marked to pending with message (CannotCreateBuildPod) after hit the project pods limits.
But,after free up my resource, the pending build keeps pending with  (CannotCreateBuildPod)
IMO,it's not acceptable, right?

#oc get  builds 
NAME                  TYPE      FROM      STATUS                           STARTED          DURATION
ruby-sample-build-2   Source    Git       Failed                           39 minutes ago   2m2s
ruby-sample-build-3   Source    Git       Complete                         36 minutes ago   3m1s
ruby-sample-build-4   Source    Git       Pending (CannotCreateBuildPod)        

# oc describe  project  xiuwangquota
Name:		xiuwangquota
Created:	About an hour ago
Labels:		<none>
Annotations:	openshift.io/description=
		openshift.io/display-name=
		openshift.io/sa.scc.mcs=s0:c13,c12
		openshift.io/sa.scc.supplemental-groups=1000180000/10000
		openshift.io/sa.scc.uid-range=1000180000/10000
Display Name:	<none>
Description:	<none>
Status:		Active
Node Selector:	<none>
Quota:
	Name:			quota
	Resource		Used	Hard
	--------		----	----
	cpu			600m	1
	memory			300Mi	750Mi
	pods			2	4
	replicationcontrollers	2	10
	resourcequotas		1	1
	services		2	10
Resource limits:
	Name:		limits
	Type		Resource	Min	Max	Default
	----		--------	---	---	---
	Pod		cpu		10m	500m	-
	Pod		memory		5Mi	750Mi	-
	Container	memory		5Mi	750Mi	100Mi
	Container	cpu		10m	500m	100m

Comment 13 Ben Parees 2015-10-26 13:42:44 UTC

Did the build ever enter the running state? I would expect the message to remain "Pending (CannotCreateBuildPod)" until the system recognizes resources are available and starts running the pod, at which point the state should change to "Running"

if the pod never entered the "Running" state after resources became available, I think that's a scheduler issue.

Comment 14 XiuJuan Wang 2015-10-27 06:56:10 UTC

The pending(CannotCreateBuildPod) build never entered the "Running" state after resources became available.
But trigger a new build will be running if the resources are available.

Comment 15 Ben Parees 2015-10-27 12:39:56 UTC

Derek, who does this bug need to go to?  Sounds like a node/scheduling problem.

Comment 16 Derek Carr 2015-11-16 14:36:31 UTC

Ben,

I read the bug and it's not clear to me that we have shown that the build pod was ever actually created.  Is there output that shows the build pod was in fact created?  If so, we could look at events around that pod to know why it may not have been scheduled.  Right now, best I can tell in the discussion is that no build pod was created for builds that were in this state.

Thanks,
Derek

Comment 17 Ben Parees 2015-11-16 14:56:00 UTC

Thanks Derek, you're right, we have a bug here in which we 
1) update the build phase to Pending
2) hit an error creating the build pod (in this case a quota limit)
3) still end up committing the updated build object because we're trying to reflect the error from (2) but this also ends up reflecting the phase change from (1).

We need to not update the build phase if an error occurs creating the build pod.

Comment 18 Ben Parees 2015-11-16 15:00:02 UTC

Turns out Cesar already has a pull that should fix this:
https://github.com/openshift/origin/pull/5743

Comment 19 Ben Parees 2015-11-16 15:01:50 UTC

incidentally the "never schedules" portion of this bug is really a dupe of:
https://bugzilla.redhat.com/show_bug.cgi?id=1278232

Comment 20 Wang Haoran 2016-01-20 03:20:57 UTC

dupe bug , already verified from https://bugzilla.redhat.com/show_bug.cgi?id=1278232#c10