Bug 1565048

Summary: failedJobsHistoryLimit field does not work as expected in a cron job
Product: OpenShift Container Platform Reporter: Fatima <fshaikh>
Component: MasterAssignee: Maciej Szulik <maszulik>
Status: CLOSED NOTABUG QA Contact: Wang Haoran <haowang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: aos-bugs, byount, jokerman, maszulik, mfojtik, mmccomas, sgaikwad
Target Milestone: ---Keywords: Reopened
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-07 13:45:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Fatima 2018-04-09 08:59:25 UTC
Description of problem:
The field failedJobsHistoryLimit does not work as expected. The pods created by the cronjob show ContainerCannotRun 

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:

1. Create cron job vi cronjob.yaml

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: cron-job-test
spec:
  schedule: "*/1 * * * *"
  successfulJobsHistoryLimit: 4
  failedJobsHistoryLimit: 2
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cron-job-test
            image: perl
            command: ["perl5",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
          restartPolicy: OnFailure

  --->

  Here we are using perl5 (which is wrong), so that atleast 2 failed jobs are retained.

2. oc create -f cronjob.yaml

3. oc get jobs -w

Actual results:

A number of failed jobs get populated instead of maintaining 2(as mentioned in the yaml definition).

Expected results:

2 failed jobs should be maintained.

Additional info:

The customer is testing this feature and came across this.

Comment 2 Maciej Szulik 2018-04-11 15:14:00 UTC
This is working as expected. When you set the restartPolicy to OnFailure the kubelet is responsible for restarting the pod. Iow. it will retry the pod about 6 times, each with a longer delay (10s, 20s, etc.), see [1] for details. 
This means that it will longer time for the pod to actually fail (the controller does not treat CrashLoopBackOff as a failure). But once it reaches the failed state (RunContainerError) the controller ensures there are no more than 2 failed jobs.

When you set the restart policy to Never then you can tweak the backoffLimit parameter, there is more details about the topic in [2].


[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
[2] https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#handling-pod-and-container-failures

Comment 9 Maciej Szulik 2018-05-10 09:58:42 UTC
Here's a sample failing cronjob:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  failedJobsHistoryLimit: 1
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      backoffLimit: 1
      template:
        metadata:
          name: hello
          labels:
            job: test
        spec:
          containers:
          - name: hello
            image: busybox
            command: ["/bin/sh",  "-c", "exit 1"]
          restartPolicy: Never

Notice two elements:
- backoffLimit - which will allow the job to fail fast
- the command that results in container to fail