Hide Forgot
Description of problem: Job Controller does not work correctly as backoffLimit configured. I found job controller have a known issue[0] around "backoffLimit", so I configured "restartPolicy: Never" before creating job. But even though I has configured as workaround, Frequently the job has stopped to try to create Pods before reaching the "backoffLimit". The result seems "backoffLimit" is ignored. [0] Pod Backoff failure policy [https://v1-11.docs.kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/] ~~~ Note: Due to a known issue #54870[1], when the .spec.template.spec.restartPolicy field is set to “OnFailure”, the back-off limit may be ineffective. As a short-term workaround, set the restart policy for the embedded template to “Never”. ~~~ Version-Release number of selected component (if applicable): openshift v3.11.98 kubernetes v1.11.0+d4cacc0 How reproducible: You can reproduce this issue by repeating to create the following job resource in 2 ~ 3 times. Then you can verify how many job has been tried to create a pod to process the workload from "oc describe job". 1> create testing project # oc new-project job-test 2> create the following job for test # oc create -f <<EOF apiVersion: batch/v1 kind: Job metadata: name: test-job spec: backoffLimit: 6 completions: 1 parallelism: 1 template: spec: containers: - name: test-job image: perl command: ["perl", "-e", 'sleep 10; print "working\n"; exit 89'] restartPolicy: Never EOF 3> verify the failed count. Look "Pods Statuses" section the failed count is not "6" though "backoffLimit" is configured "6". Somtimes it might be "6", if you create again the job. # oc describe job test-job Name: test-job Namespace: job-test ...snip... Parallelism: 1 Completions: 1 Start Time: Tue, 23 Apr 2019 05:18:38 -0400 Pods Statuses: 0 Running / 0 Succeeded / 5 Failed ...snip... You might repeat steps between "2>" and "3>" to reproduce this issue. Steps to Reproduce: 1. 2. 3. Actual results: Although "backoffLimit" is "6", frequently the pod does not create a pod to process job until "6" times. It looks like "backoffLimit" ignored Expected results: If the job is failed, always the job should create a pod to work correctly as "backoffLimit" configured. "backoffLimit" should not be ignored during job failed. Additional info: There were issues around "backoffLimit" as follows. * Pod Backoff failure policy [https://v1-11.docs.kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/] ~~~ Note: Due to a known issue #54870[1], when the .spec.template.spec.restartPolicy field is set to “OnFailure”, the back-off limit may be ineffective. As a short-term workaround, set the restart policy for the embedded template to “Never”. ~~~ [1] Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure [https://github.com/kubernetes/kubernetes/issues/54870] * Backoff policy and failed pod limit [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apps/job.md#backoff-policy-and-failed-pod-limit]
There's https://github.com/kubernetes/kubernetes/pull/67859 which should solve the issue, but that will be available in 4.1, since it's part of k8s 1.13. Based on that I'm setting target release 4.1 and moving to qa.
I still can reproduce the issue with ocp version: [root@dhcp-140-138 ~]# oc version Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"74c534b60", GitTreeState:"", BuildDate:"2019-04-21T21:13:18Z", GoVersion:"", Compiler:"", Platform:""} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+81fc896", GitCommit:"81fc896", GitTreeState:"clean", BuildDate:"2019-04-21T23:18:54Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Payload: 4.1.0-0.nightly-2019-04-22-192604 [root@dhcp-140-138 ~]# oc get job test-job -o yaml apiVersion: batch/v1 kind: Job metadata: creationTimestamp: "2019-04-25T01:11:28Z" labels: controller-uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc job-name: test-job name: test-job namespace: test6 resourceVersion: "789545" selfLink: /apis/batch/v1/namespaces/test6/jobs/test-job uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc spec: backoffLimit: 6 completions: 1 parallelism: 1 selector: matchLabels: controller-uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc template: metadata: creationTimestamp: null labels: controller-uid: 0d819ac4-66f7-11e9-ad9c-068b1b1786bc job-name: test-job spec: containers: - command: - perl - -e - sleep 10; print "working\n"; exit 89 image: perl imagePullPolicy: Always name: test-job resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Never schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 status: conditions: - lastProbeTime: "2019-04-25T01:15:51Z" lastTransitionTime: "2019-04-25T01:15:51Z" message: Job has reached the specified backoff limit reason: BackoffLimitExceeded status: "True" type: Failed failed: 5 startTime: "2019-04-25T01:11:28Z"
The mechanism underneath is not a strong guarantee, but rather a best effort, so it may happen that sometimes the fail number might not reach the desired. Like mentioned in https://github.com/kubernetes/kubernetes/issues/64787 and https://github.com/kubernetes/kubernetes/issues/70251 From my tests I reached the number every time and never had issues, but that's just author's luck. Please retest and report failure rate and accept it if it's ~80%.
Retested and the failure rate less than 20%: The first time, create 7 jobs, only 1 failed; The second time, create 7 jobs, all succeed. So will accept it. [zhouying@dhcp-140-138 ~]$ oc version Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"44e89e525", GitTreeState:"clean", BuildDate:"2019-04-25T22:42:17Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+5c41ab6", GitCommit:"5c41ab6", GitTreeState:"clean", BuildDate:"2019-04-24T20:42:36Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758