Bug 1829447 - Build pods falling under NotTerminating quota scope
Summary: Build pods falling under NotTerminating quota scope
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Gabe Montero
QA Contact: wewang
URL:
Whiteboard:
Depends On:
Blocks: 1835913
TreeView+ depends on / blocked
 
Reported: 2020-04-29 14:39 UTC by Sergio G.
Modified: 2020-07-13 17:33 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: with the RunOnceDuration admission plugin disabled in 4.x, there is no longer automatic applying of activeDeadlineSeconds to build pods Consequence: Incorrect behavior with quotas can occur, namely quota shouldn't be required as build pods have the RestartPolicy to Never and that makes them fall outside the NotTerminating scope. Fix: the build controller is updated to provide an acceptable default for activeDeadlineSeconds given its design of being a run once pod that always terminates. Result: NotTerminating scope with quotas are handled properly for builds, namely the quota shouldn't be required as build pods have the RestartPolicy to Never and that makes them fall outside the NotTerminating scope.
Clone Of:
: 1835913 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:32:51 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift openshift-controller-manager pull 105 None closed Bug 1829447: ensure build pod activeDeadlineSeconds always set 2020-09-01 11:45:42 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:33:14 UTC

Description Sergio G. 2020-04-29 14:39:03 UTC
Description of problem:
According to documentation [1]:
 - NotTerminating scope in quotas matches pods where spec.activeDeadlineSeconds is nil.
 - Build pods will fall under NotTerminating unless the RestartNever policy is applied.


Version-Release number of selected component (if applicable):
4.3


How reproducible:
Always


Steps to Reproduce:
$ oc new-project test-quota

$ cat <<EOF | oc create -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: limit-memory
spec:
  hard:
    limits.memory: 2Gi
  scopes:
  - NotTerminating
EOF

$ oc new-app https://github.com/openshift/nodejs-ex


Actual results:
$ oc get builds
NAME          TYPE     FROM   STATUS                       STARTED   DURATION
nodejs-ex-1   Source   Git    New (CannotCreateBuildPod)

$ oc describe build/nodejs-ex-1
Events:
  Type		Reason		Age			From			Message
  ----		------		----			----			-------
  Warning	FailedCreate	16s (x13 over 36s)	build-controller	Error creating build pod: pods "nodejs-ex-4-build" is forbidden: failed quota: limit-memory: must specify limits.memory


Expected results:
The quota shouldn't be required as build pods have the RestartPolicy to Never and that makes them fall outside the NotTerminating scope.


References:
[1] https://docs.openshift.com/container-platform/4.3/applications/quotas/quotas-setting-per-project.html

Comment 1 Stephen Cuppett 2020-04-29 15:26:40 UTC
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 2 Adam Kaplan 2020-04-29 18:07:31 UTC
I believe ResourceQuota is behaving as expected, though I do not fully understand what "the RestartNever policy is applied" means. I do not believe this refers to the standard restart policy that is assigned to pods.

To work around this issue, you can set a cluster-wide BuildDefault that sets the memory limit to 2Gi [1]. Unfortunately this is a cluster-wide setting - for a specific namespace you can use LimitRanges [2].

[1] https://docs.openshift.com/container-platform/4.3/builds/build-configuration.html#builds-configuration-parameters_build-configuration
[2] https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/memory-default-namespace/

Comment 4 Sergio G. 2020-04-29 19:11:17 UTC
Hi Adam.
As far as I understand, the "the RestartNever policy is applied" means a RestartPolicy=Never in the pods. 

Following the breadcrumbs in our documentation, I found that there was a plugin named RunOnceDuration [1] which was used to include the default active deadline for run-once pods, like the build pods. Thats why the documentation was probably back in the time but it's not correct nowadays.

What I'd like to know is: is this a bug in the product or in the documentation?

If it's a bug in the documentation and we do require a limit in the buildConfigs, then let's do it. If it's a bug in the product which is not applying an activeDeadlineSeconds to build pods, then let's fix it.


[1] https://docs.openshift.com/container-platform/3.6/admin_guide/managing_pods.html#configuring-the-run-once-duration-plug-in

Comment 5 Michal Fojtik 2020-05-05 07:03:03 UTC
RunOnceDuration configuration is disabled since 4.0, the "RestartNever" is standard container restart policy (Always, Never, etc.).

Moving back to devex team to decide whether they want to ship the RunOnceDuration admission as webhook admission plugin, which will involve creating new operator.

Comment 6 Sergio G. 2020-05-05 12:30:40 UTC
>> RunOnceDuration configuration is disabled since 4.0, the "RestartNever" is standard container restart policy (Always, Never, etc.).

This means that, until the RunOnceDuration admission is back (or any other mean which makes the buildPods have a very long activeDeadlineSeconds) the documentation is wrong as build pods are falling under NotTerminating.

I'll probably document it in a KCS until you guys make a decission:
 - implement any change to have such activeDeadlineSeconds in build pods
 - decide that build pods aren't special at all and they must have a limit/quota as other pods, which can be set per build or cluster-wide using the build overrides.

Comment 7 Sergio G. 2020-05-05 13:17:51 UTC
Adding KCS for future cases https://access.redhat.com/solutions/5049021

Comment 8 Adam Kaplan 2020-05-06 13:53:52 UTC
Adding a long activeDeadlineSeconds makes the most sense IMO - build pods are designed to terminate. I propose 1 week as a fixed active deadline (604800 seconds).

Comment 9 Gabe Montero 2020-05-11 20:29:50 UTC
talked with Corey ... I'm going to take this one

Comment 13 wewang 2020-05-14 07:39:38 UTC
Verified in version:
4.5.0-0.nightly-2020-05-13-202437

Steps:
$ cat <<EOF | oc create -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: limit-memory
spec:
  hard:
    limits.memory: 2Gi
  scopes:
  - NotTerminating
EOF

$ oc new-app https://github.com/openshift/nodejs-ex
$ oc get builds
NAME          TYPE     FROM          STATUS     STARTED              DURATION
nodejs-ex-1   Source   Git@a096bd2   Complete   About a minute ago   1m4s

Comment 14 Sergio G. 2020-05-14 15:49:08 UTC
Will the fix backported to 4.4 and/or 4.3 ?

Comment 17 errata-xmlrpc 2020-07-13 17:32:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.