Bug 1702543

Summary: Job Controller does not work correctly as backoffLimit configured
Product: OpenShift Container Platform Reporter: Daein Park <dapark>
Component: MasterAssignee: Maciej Szulik <maszulik>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, jokerman, maszulik, mmccomas, yinzhou
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daein Park 2019-04-24 05:32:00 UTC
Description of problem:

Job Controller does not work correctly as backoffLimit configured.
I found job controller have a known issue[0] around "backoffLimit", so I configured "restartPolicy: Never" before creating job.
But even though I has configured as workaround, Frequently the job has stopped to try to create Pods before reaching the "backoffLimit".

The result seems "backoffLimit" is ignored.


[0] Pod Backoff failure policy
    [https://v1-11.docs.kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/]
    ~~~
    Note: Due to a known issue #54870[1], when the .spec.template.spec.restartPolicy field is set to “OnFailure”, 
    the back-off limit may be ineffective. As a short-term workaround, set the restart policy for the embedded template to “Never”.
    ~~~

Version-Release number of selected component (if applicable):

openshift v3.11.98
kubernetes v1.11.0+d4cacc0

How reproducible:

You can reproduce this issue by repeating to create the following job resource in 2 ~ 3 times.
Then you can verify how many job has been tried to create a pod to process the workload from "oc describe job".

1> create testing project
# oc new-project job-test

2> create the following job for test

# oc create -f <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
spec:
  backoffLimit: 6
  completions: 1
  parallelism: 1
  template:
    spec:
      containers:
      - name: test-job
        image: perl
        command: ["perl",  "-e", 'sleep 10; print "working\n"; exit 89']
      restartPolicy: Never
EOF

3> verify the failed count. Look "Pods Statuses" section the failed count is not "6" though "backoffLimit" is configured "6".
   Somtimes it might be "6", if you create again the job.

# oc describe job test-job 
Name:           test-job
Namespace:      job-test
...snip...
Parallelism:    1
Completions:    1
Start Time:     Tue, 23 Apr 2019 05:18:38 -0400
Pods Statuses:  0 Running / 0 Succeeded / 5 Failed
...snip...

You might repeat steps between "2>" and "3>" to reproduce this issue.


Steps to Reproduce:
1.
2.
3.

Actual results:

Although "backoffLimit" is "6", frequently the pod does not create a pod to process job until "6" times.
It looks like "backoffLimit" ignored

Expected results:

If the job is failed, always the job should create a pod to work correctly as "backoffLimit" configured.
"backoffLimit" should not be ignored during job failed.

Additional info:

There were issues around "backoffLimit" as follows.

* Pod Backoff failure policy
  [https://v1-11.docs.kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/]
  ~~~
  Note: Due to a known issue #54870[1], when the .spec.template.spec.restartPolicy field is set to “OnFailure”, 
  the back-off limit may be ineffective. As a short-term workaround, set the restart policy for the embedded template to “Never”.
  ~~~

  [1] Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure 
      [https://github.com/kubernetes/kubernetes/issues/54870]

* Backoff policy and failed pod limit
  [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apps/job.md#backoff-policy-and-failed-pod-limit]

Comment 1 Maciej Szulik 2019-04-24 08:35:41 UTC
There's https://github.com/kubernetes/kubernetes/pull/67859 which should solve the issue, but that will be available in 4.1,
since it's part of k8s 1.13. Based on that I'm setting target release 4.1 and moving to qa.

Comment 5 zhou ying 2019-04-25 01:28:25 UTC
I still can reproduce the issue with ocp version: 
[root@dhcp-140-138 ~]# oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"74c534b60", GitTreeState:"", BuildDate:"2019-04-21T21:13:18Z", GoVersion:"", Compiler:"", Platform:""}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+81fc896", GitCommit:"81fc896", GitTreeState:"clean", BuildDate:"2019-04-21T23:18:54Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Payload: 4.1.0-0.nightly-2019-04-22-192604

[root@dhcp-140-138 ~]# oc get job test-job -o yaml 
apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: "2019-04-25T01:11:28Z"
  labels:
    controller-uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc
    job-name: test-job
  name: test-job
  namespace: test6
  resourceVersion: "789545"
  selfLink: /apis/batch/v1/namespaces/test6/jobs/test-job
  uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc
spec:
  backoffLimit: 6
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc
  template:
    metadata:
      creationTimestamp: null
      labels:
        controller-uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc
        job-name: test-job
    spec:
      containers:
      - command:
        - perl
        - -e
        - sleep 10; print "working\n"; exit 89
        image: perl
        imagePullPolicy: Always
        name: test-job
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  conditions:
  - lastProbeTime: "2019-04-25T01:15:51Z"
    lastTransitionTime: "2019-04-25T01:15:51Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 5
  startTime: "2019-04-25T01:11:28Z"

Comment 7 Maciej Szulik 2019-04-25 13:15:22 UTC
The mechanism underneath is not a strong guarantee, but rather a best effort, so it may happen that sometimes the fail number might not reach
the desired. Like mentioned in https://github.com/kubernetes/kubernetes/issues/64787 and https://github.com/kubernetes/kubernetes/issues/70251
From my tests I reached the number every time and never had issues, but that's just author's luck. 

Please retest and report failure rate and accept it if it's ~80%.

Comment 8 zhou ying 2019-04-26 03:04:26 UTC
Retested and the failure rate less than 20%:

The first time, create 7 jobs, only 1 failed;
The second time, create 7 jobs, all succeed.  So will accept it. 

[zhouying@dhcp-140-138 ~]$ oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"44e89e525", GitTreeState:"clean", BuildDate:"2019-04-25T22:42:17Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+5c41ab6", GitCommit:"5c41ab6", GitTreeState:"clean", BuildDate:"2019-04-24T20:42:36Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Comment 10 errata-xmlrpc 2019-06-04 10:47:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758