Bug 1702543 - Job Controller does not work correctly as backoffLimit configured
Summary: Job Controller does not work correctly as backoffLimit configured
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.1.0
Assignee: Maciej Szulik
QA Contact: zhou ying
Depends On:
TreeView+ depends on / blocked
Reported: 2019-04-24 05:32 UTC by Daein Park
Modified: 2019-06-04 10:48 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2019-06-04 10:47:56 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:48:03 UTC

Description Daein Park 2019-04-24 05:32:00 UTC
Description of problem:

Job Controller does not work correctly as backoffLimit configured.
I found job controller have a known issue[0] around "backoffLimit", so I configured "restartPolicy: Never" before creating job.
But even though I has configured as workaround, Frequently the job has stopped to try to create Pods before reaching the "backoffLimit".

The result seems "backoffLimit" is ignored.

[0] Pod Backoff failure policy
    Note: Due to a known issue #54870[1], when the .spec.template.spec.restartPolicy field is set to “OnFailure”, 
    the back-off limit may be ineffective. As a short-term workaround, set the restart policy for the embedded template to “Never”.

Version-Release number of selected component (if applicable):

openshift v3.11.98
kubernetes v1.11.0+d4cacc0

How reproducible:

You can reproduce this issue by repeating to create the following job resource in 2 ~ 3 times.
Then you can verify how many job has been tried to create a pod to process the workload from "oc describe job".

1> create testing project
# oc new-project job-test

2> create the following job for test

# oc create -f <<EOF
apiVersion: batch/v1
kind: Job
  name: test-job
  backoffLimit: 6
  completions: 1
  parallelism: 1
      - name: test-job
        image: perl
        command: ["perl",  "-e", 'sleep 10; print "working\n"; exit 89']
      restartPolicy: Never

3> verify the failed count. Look "Pods Statuses" section the failed count is not "6" though "backoffLimit" is configured "6".
   Somtimes it might be "6", if you create again the job.

# oc describe job test-job 
Name:           test-job
Namespace:      job-test
Parallelism:    1
Completions:    1
Start Time:     Tue, 23 Apr 2019 05:18:38 -0400
Pods Statuses:  0 Running / 0 Succeeded / 5 Failed

You might repeat steps between "2>" and "3>" to reproduce this issue.

Steps to Reproduce:

Actual results:

Although "backoffLimit" is "6", frequently the pod does not create a pod to process job until "6" times.
It looks like "backoffLimit" ignored

Expected results:

If the job is failed, always the job should create a pod to work correctly as "backoffLimit" configured.
"backoffLimit" should not be ignored during job failed.

Additional info:

There were issues around "backoffLimit" as follows.

* Pod Backoff failure policy
  Note: Due to a known issue #54870[1], when the .spec.template.spec.restartPolicy field is set to “OnFailure”, 
  the back-off limit may be ineffective. As a short-term workaround, set the restart policy for the embedded template to “Never”.

  [1] Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure 

* Backoff policy and failed pod limit

Comment 1 Maciej Szulik 2019-04-24 08:35:41 UTC
There's https://github.com/kubernetes/kubernetes/pull/67859 which should solve the issue, but that will be available in 4.1,
since it's part of k8s 1.13. Based on that I'm setting target release 4.1 and moving to qa.

Comment 5 zhou ying 2019-04-25 01:28:25 UTC
I still can reproduce the issue with ocp version: 
[root@dhcp-140-138 ~]# oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"74c534b60", GitTreeState:"", BuildDate:"2019-04-21T21:13:18Z", GoVersion:"", Compiler:"", Platform:""}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+81fc896", GitCommit:"81fc896", GitTreeState:"clean", BuildDate:"2019-04-21T23:18:54Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Payload: 4.1.0-0.nightly-2019-04-22-192604

[root@dhcp-140-138 ~]# oc get job test-job -o yaml 
apiVersion: batch/v1
kind: Job
  creationTimestamp: "2019-04-25T01:11:28Z"
    controller-uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc
    job-name: test-job
  name: test-job
  namespace: test6
  resourceVersion: "789545"
  selfLink: /apis/batch/v1/namespaces/test6/jobs/test-job
  uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc
  backoffLimit: 6
  completions: 1
  parallelism: 1
      controller-uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc
      creationTimestamp: null
        controller-uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc
        job-name: test-job
      - command:
        - perl
        - -e
        - sleep 10; print "working\n"; exit 89
        image: perl
        imagePullPolicy: Always
        name: test-job
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
  - lastProbeTime: "2019-04-25T01:15:51Z"
    lastTransitionTime: "2019-04-25T01:15:51Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 5
  startTime: "2019-04-25T01:11:28Z"

Comment 7 Maciej Szulik 2019-04-25 13:15:22 UTC
The mechanism underneath is not a strong guarantee, but rather a best effort, so it may happen that sometimes the fail number might not reach
the desired. Like mentioned in https://github.com/kubernetes/kubernetes/issues/64787 and https://github.com/kubernetes/kubernetes/issues/70251
From my tests I reached the number every time and never had issues, but that's just author's luck. 

Please retest and report failure rate and accept it if it's ~80%.

Comment 8 zhou ying 2019-04-26 03:04:26 UTC
Retested and the failure rate less than 20%:

The first time, create 7 jobs, only 1 failed;
The second time, create 7 jobs, all succeed.  So will accept it. 

[zhouying@dhcp-140-138 ~]$ oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"44e89e525", GitTreeState:"clean", BuildDate:"2019-04-25T22:42:17Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+5c41ab6", GitCommit:"5c41ab6", GitTreeState:"clean", BuildDate:"2019-04-24T20:42:36Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Comment 10 errata-xmlrpc 2019-06-04 10:47:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.