Description of problem: When scheduling jobs with "activeDeadlineSeconds", sometimes the job is still running even after the deadline exceeds. Version-Release number of selected component (if applicable): OpenShift Master: v3.3.1.3 Kubernetes Master: v1.3.0+52492b4 How reproducible: Always Steps to Reproduce: 1. Create a scheduledjob with "activeDeadlineSeconds: $ cat sj.yaml apiVersion: batch/v2alpha1 kind: ScheduledJob metadata: labels: run: sj name: sj spec: jobTemplate: metadata: spec: completion: 1 Parallelism: 1 activeDeadlineSeconds: 10 template: metadata: labels: run: sj spec: containers: - args: - sleep - "90" image: busybox imagePullPolicy: Always name: sj resources: {} restartPolicy: Never schedule: '* * * * *' suspend: false $ oc create -f sj.yaml 2. Check if the job's pod would be terminated when reaching "activeDeadlineSeconds". Actual results: 2. Sometimes $ oc get job NAME DESIRED SUCCESSFUL AGE sj-1552910142 1 0 2m sj-1628735297 1 1 3m sj-1628866369 1 0 30s sj-1704691524 1 0 1m $ oc get pod NAME READY STATUS RESTARTS AGE sj-1628735297-97di6 0/1 Completed 0 3m sj-1628866369-ima1n 1/1 Running 0 35s $ oc describe job sj-1552910142 ... Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 18m 18m 1 {job-controller } Normal SuccessfulCreate Created pod: sj-1552910142-uughm 17m 17m 1 {job-controller } Normal SuccessfulDelete Deleted pod: sj-1552910142-uughm 17m 17m 2 {job-controller } Normal DeadlineExceeded Job was active longer than specified deadline $ oc get pod sj-1552910142-uughm No resources found. Error from server: pods "sj-1552910142-uughm" not found $ oc describe job sj-1628735297 ... Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 20m 20m 1 {job-controller } Normal SuccessfulCreate Created pod: sj-1628735297-97di6 18m 18m 1 {job-controller } Normal DeadlineExceeded Job was active longer than specified deadline $ oc get pod sj-1628735297-97di6 -o yaml ... finishedAt: 2016-11-18T08:10:40Z reason: Completed startedAt: 2016-11-18T08:09:10Z Expected results: Job should finish when reaching "activeDeadlineSeconds" and delete the pod, instead of wait untill the pod completes.
This is known problem with activeDeadlineSeconds in jobs, which should be supported. See https://github.com/kubernetes/kubernetes/issues/32149 for more details. At this point I don't have any ETA for it, yet. This only affects short ADS, where short means here less then 10mins, which is the full resync time in the job controller.
The upstream issue is not resolved yet and I don't think it will make it for 1.6, so adding target release to be 3.7 which more reflects the reality.
It'll be part of 3.8 release, at soonest, looking at the upstream issue. Moving target accordingly.
This is waiting for https://github.com/openshift/origin/pull/17115, I doubt this will happen this sprint, so I'm adding UpcomingSprint keyword.
It works fine with: oc v3.8.0-alpha.0+fe6445a-249 kubernetes v1.8.1+0d5291c features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://127.0.0.1:8443 openshift v3.8.0-alpha.0+e6b20e1 kubernetes v1.7.6+a08f5eeb6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489