Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1489181

Summary: Support for quotas on number of jobs
Product: OpenShift Container Platform Reporter: Vítor Corrêa <vcorrea>
Component: MasterAssignee: Maciej Szulik <maszulik>
Status: CLOSED ERRATA QA Contact: Wang Haoran <haowang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: aos-bugs, jokerman, maszulik, mchappel, mfojtik, mmccomas
Target Milestone: ---   
Target Release: 3.6.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
The future is available in 3.6. No need for doc update.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-25 13:06:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vítor Corrêa 2017-09-06 21:39:30 UTC
Description of problem:


OpenShift instance had spawned a massive number of scheduled jobs, initially these were failing because they couldn't access any service in the DC ( services were down).  However, because they'd failed a large number of pending Jobs then backed-up which were then all failing because they didn't have enough quota to start the pods.  Resulting in lots of deploy pods starting up and failing in quick succession and exhausting various resources.

Version-Release number of selected component (if applicable): ocp 3.4

How reproducible:


Actual results:


Expected results:

Comment 2 Maciej Szulik 2017-09-14 14:40:35 UTC
Unfortunately, this is a limitation of CronJobs and one of the reasons they are technology preview. In the mean time there has been several layers created to address those problems:
1. successfulJobsHistoryLimit and failedJobsHistoryLimit in the CronJob spec starting from version 3.6 of origin
2. job failure policy added in k8s 1.8, which will land in origin 3.8.

Comment 4 Maciej Szulik 2017-09-14 14:47:17 UTC
Based on my previous comment about successfulJobsHistoryLimit and failedJobsHistoryLimit being available in origin 3.6. These values will limit the number of old jobs kept around on the server.

Comment 6 Mark Chappell 2017-10-25 07:42:18 UTC
@Maciej,

Being able to apply a quota to limit the number of running/pending Jobs would still be very helpful for the original use case.  Is there an upstream tracker for that?

(As the original customer, we're not too worried about this being back-ported, we're aware CronJobs are tech preview and are being fairly aggressive about updates)

Comment 7 Wang Haoran 2017-10-25 07:55:44 UTC
We have a trello card doing generic object count quota:
https://trello.com/c/ayBfZlz8/1033-5-upstream-generic-object-count-quota

Comment 8 Mark Chappell 2017-10-25 07:57:43 UTC
Fantastic, thanks Wang

Comment 9 errata-xmlrpc 2017-10-25 13:06:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049