Bug 1829447

Summary:	Build pods falling under NotTerminating quota scope
Product:	OpenShift Container Platform	Reporter:	Sergio G. <sgarciam>
Component:	Build	Assignee:	Gabe Montero <gmontero>
Status:	CLOSED ERRATA	QA Contact:	wewang <wewang>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.3.z	CC:	aos-bugs, gmontero, mfojtik, sbhavsar, scuppett, wzheng
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: with the RunOnceDuration admission plugin disabled in 4.x, there is no longer automatic applying of activeDeadlineSeconds to build pods Consequence: Incorrect behavior with quotas can occur, namely quota shouldn't be required as build pods have the RestartPolicy to Never and that makes them fall outside the NotTerminating scope. Fix: the build controller is updated to provide an acceptable default for activeDeadlineSeconds given its design of being a run once pod that always terminates. Result: NotTerminating scope with quotas are handled properly for builds, namely the quota shouldn't be required as build pods have the RestartPolicy to Never and that makes them fall outside the NotTerminating scope.	Story Points:	---
Clone Of:
Clones:	1835913 (view as bug list)		Environment:
Last Closed:	2020-07-13 17:32:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1835913

Description Sergio G. 2020-04-29 14:39:03 UTC

Description of problem:
According to documentation [1]:
 - NotTerminating scope in quotas matches pods where spec.activeDeadlineSeconds is nil.
 - Build pods will fall under NotTerminating unless the RestartNever policy is applied.


Version-Release number of selected component (if applicable):
4.3


How reproducible:
Always


Steps to Reproduce:
$ oc new-project test-quota

$ cat <<EOF | oc create -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: limit-memory
spec:
  hard:
    limits.memory: 2Gi
  scopes:
  - NotTerminating
EOF

$ oc new-app https://github.com/openshift/nodejs-ex


Actual results:
$ oc get builds
NAME          TYPE     FROM   STATUS                       STARTED   DURATION
nodejs-ex-1   Source   Git    New (CannotCreateBuildPod)

$ oc describe build/nodejs-ex-1
Events:
  Type		Reason		Age			From			Message
  ----		------		----			----			-------
  Warning	FailedCreate	16s (x13 over 36s)	build-controller	Error creating build pod: pods "nodejs-ex-4-build" is forbidden: failed quota: limit-memory: must specify limits.memory


Expected results:
The quota shouldn't be required as build pods have the RestartPolicy to Never and that makes them fall outside the NotTerminating scope.


References:
[1] https://docs.openshift.com/container-platform/4.3/applications/quotas/quotas-setting-per-project.html

Comment 1 Stephen Cuppett 2020-04-29 15:26:40 UTC

Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 2 Adam Kaplan 2020-04-29 18:07:31 UTC

I believe ResourceQuota is behaving as expected, though I do not fully understand what "the RestartNever policy is applied" means. I do not believe this refers to the standard restart policy that is assigned to pods.

To work around this issue, you can set a cluster-wide BuildDefault that sets the memory limit to 2Gi [1]. Unfortunately this is a cluster-wide setting - for a specific namespace you can use LimitRanges [2].

[1] https://docs.openshift.com/container-platform/4.3/builds/build-configuration.html#builds-configuration-parameters_build-configuration
[2] https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/memory-default-namespace/

Comment 4 Sergio G. 2020-04-29 19:11:17 UTC

Hi Adam.
As far as I understand, the "the RestartNever policy is applied" means a RestartPolicy=Never in the pods. 

Following the breadcrumbs in our documentation, I found that there was a plugin named RunOnceDuration [1] which was used to include the default active deadline for run-once pods, like the build pods. Thats why the documentation was probably back in the time but it's not correct nowadays.

What I'd like to know is: is this a bug in the product or in the documentation?

If it's a bug in the documentation and we do require a limit in the buildConfigs, then let's do it. If it's a bug in the product which is not applying an activeDeadlineSeconds to build pods, then let's fix it.


[1] https://docs.openshift.com/container-platform/3.6/admin_guide/managing_pods.html#configuring-the-run-once-duration-plug-in

Comment 5 Michal Fojtik 2020-05-05 07:03:03 UTC

RunOnceDuration configuration is disabled since 4.0, the "RestartNever" is standard container restart policy (Always, Never, etc.).

Moving back to devex team to decide whether they want to ship the RunOnceDuration admission as webhook admission plugin, which will involve creating new operator.

Comment 6 Sergio G. 2020-05-05 12:30:40 UTC

>> RunOnceDuration configuration is disabled since 4.0, the "RestartNever" is standard container restart policy (Always, Never, etc.).

This means that, until the RunOnceDuration admission is back (or any other mean which makes the buildPods have a very long activeDeadlineSeconds) the documentation is wrong as build pods are falling under NotTerminating.

I'll probably document it in a KCS until you guys make a decission:
 - implement any change to have such activeDeadlineSeconds in build pods
 - decide that build pods aren't special at all and they must have a limit/quota as other pods, which can be set per build or cluster-wide using the build overrides.

Comment 7 Sergio G. 2020-05-05 13:17:51 UTC

Adding KCS for future cases https://access.redhat.com/solutions/5049021

Comment 8 Adam Kaplan 2020-05-06 13:53:52 UTC

Adding a long activeDeadlineSeconds makes the most sense IMO - build pods are designed to terminate. I propose 1 week as a fixed active deadline (604800 seconds).

Comment 9 Gabe Montero 2020-05-11 20:29:50 UTC

talked with Corey ... I'm going to take this one

Comment 13 wewang 2020-05-14 07:39:38 UTC

Verified in version:
4.5.0-0.nightly-2020-05-13-202437

Steps:
$ cat <<EOF | oc create -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: limit-memory
spec:
  hard:
    limits.memory: 2Gi
  scopes:
  - NotTerminating
EOF

$ oc new-app https://github.com/openshift/nodejs-ex
$ oc get builds
NAME          TYPE     FROM          STATUS     STARTED              DURATION
nodejs-ex-1   Source   Git@a096bd2   Complete   About a minute ago   1m4s

Comment 14 Sergio G. 2020-05-14 15:49:08 UTC

Will the fix backported to 4.4 and/or 4.3 ?

Comment 17 errata-xmlrpc 2020-07-13 17:32:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409