1396404 – Sometimes "activeDeadlineSeconds" in ScheduledJob doesn't take affect

Bug 1396404 - Sometimes "activeDeadlineSeconds" in ScheduledJob doesn't take affect

Summary: Sometimes "activeDeadlineSeconds" in ScheduledJob doesn't take affect

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.8.0
Assignee:	Maciej Szulik
QA Contact:	Chuan Yu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-18 09:20 UTC by Bing Li
Modified:	2018-03-28 14:05 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Where there is not activity in jobs controller will touch every job only when performing full resync on all jobs. Consequence: This results in some jobs significantly exceeding short activeDeadlineSeconds. Fix: Enguque jobs having short activeDeadlineSeconds set to be resynced more frequently. Result: Short activeDeadlineSeconds is applied correctly.
Clone Of:
Environment:
Last Closed:	2018-03-28 14:05:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:0489	0	None	None	None	2018-03-28 14:05:44 UTC

Description Bing Li 2016-11-18 09:20:05 UTC

Description of problem:
When scheduling jobs with "activeDeadlineSeconds", sometimes the job is still running even after the deadline exceeds.

Version-Release number of selected component (if applicable):
OpenShift Master: v3.3.1.3
Kubernetes Master: v1.3.0+52492b4

How reproducible:
Always

Steps to Reproduce:
1. Create a scheduledjob with "activeDeadlineSeconds:
$ cat sj.yaml 
apiVersion: batch/v2alpha1
kind: ScheduledJob
metadata:
  labels:
    run: sj
  name: sj
spec:
  jobTemplate:
    metadata:
    spec:
      completion: 1
      Parallelism: 1
      activeDeadlineSeconds: 10
      template:
        metadata:
          labels:
            run: sj
        spec:
          containers:
          - args:
            - sleep
            - "90"
            image: busybox
            imagePullPolicy: Always
            name: sj
            resources: {}
          restartPolicy: Never
  schedule: '* * * * *'
  suspend: false
$ oc create -f sj.yaml

2. Check if the job's pod would be terminated when reaching "activeDeadlineSeconds".

Actual results:
2. Sometimes 
$ oc get job
NAME            DESIRED   SUCCESSFUL   AGE
sj-1552910142   1         0            2m
sj-1628735297   1         1            3m
sj-1628866369   1         0            30s
sj-1704691524   1         0            1m
$ oc get pod
NAME                  READY     STATUS      RESTARTS   AGE
sj-1628735297-97di6   0/1       Completed   0          3m
sj-1628866369-ima1n   1/1       Running     0          35s
$ oc describe job sj-1552910142
...
Events:
  FirstSeen        LastSeen        Count        From                        SubobjectPath        Type                Reason                        Message
  ---------        --------        -----        ----                        -------------        --------        ------                        -------
  18m                18m                1        {job-controller }                        Normal                SuccessfulCreate        Created pod: sj-1552910142-uughm
  17m                17m                1        {job-controller }                        Normal                SuccessfulDelete        Deleted pod: sj-1552910142-uughm
  17m                17m                2        {job-controller }                        Normal                DeadlineExceeded        Job was active longer than specified deadline
$ oc get pod sj-1552910142-uughm
No resources found.
Error from server: pods "sj-1552910142-uughm" not found
$ oc describe job sj-1628735297
...
Events:
  FirstSeen        LastSeen        Count        From                        SubobjectPath        Type                Reason                        Message
  ---------        --------        -----        ----                        -------------        --------        ------                        -------
  20m                20m                1        {job-controller }                        Normal                SuccessfulCreate        Created pod: sj-1628735297-97di6
  18m                18m                1        {job-controller }                        Normal                DeadlineExceeded        Job was active longer than specified deadline
$ oc get pod sj-1628735297-97di6 -o yaml
...
        finishedAt: 2016-11-18T08:10:40Z
        reason: Completed
        startedAt: 2016-11-18T08:09:10Z

Expected results:
Job should finish when reaching "activeDeadlineSeconds" and delete the pod, instead of wait untill the pod completes.

Comment 2 Maciej Szulik 2016-11-23 09:59:52 UTC

This is known problem with activeDeadlineSeconds in jobs, which should be supported. See https://github.com/kubernetes/kubernetes/issues/32149 for more details. At this point I don't have any ETA for it, yet. This only affects short ADS, where short means here less then 10mins, which is the full resync time in the job controller.

Comment 3 Michal Fojtik 2017-03-27 12:26:49 UTC

The upstream issue is not resolved yet and I don't think it will make it for 1.6, so adding target release to be 3.7 which more reflects the reality.

Comment 4 Maciej Szulik 2017-08-25 09:01:56 UTC

It'll be part of 3.8 release, at soonest, looking at the upstream issue. Moving target accordingly.

Comment 5 Maciej Szulik 2017-11-03 12:31:05 UTC

This is waiting for https://github.com/openshift/origin/pull/17115, I doubt this will happen this sprint, so I'm adding UpcomingSprint keyword.

Comment 6 Wang Haoran 2017-11-29 03:35:11 UTC

It works fine with:
oc v3.8.0-alpha.0+fe6445a-249
kubernetes v1.8.1+0d5291c
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://127.0.0.1:8443
openshift v3.8.0-alpha.0+e6b20e1
kubernetes v1.7.6+a08f5eeb6

Comment 9 errata-xmlrpc 2018-03-28 14:05:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Note You need to log in before you can comment on or make changes to this bug.