Bug 1565048 - failedJobsHistoryLimit field does not work as expected in a cron job
Summary: failedJobsHistoryLimit field does not work as expected in a cron job
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.9.0
Assignee: Maciej Szulik
QA Contact: Wang Haoran
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-09 08:59 UTC by Fatima
Modified: 2018-05-10 09:58 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-05-07 13:45:08 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Fatima 2018-04-09 08:59:25 UTC
Description of problem:
The field failedJobsHistoryLimit does not work as expected. The pods created by the cronjob show ContainerCannotRun 

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:

1. Create cron job vi cronjob.yaml

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: cron-job-test
spec:
  schedule: "*/1 * * * *"
  successfulJobsHistoryLimit: 4
  failedJobsHistoryLimit: 2
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cron-job-test
            image: perl
            command: ["perl5",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
          restartPolicy: OnFailure

  --->

  Here we are using perl5 (which is wrong), so that atleast 2 failed jobs are retained.

2. oc create -f cronjob.yaml

3. oc get jobs -w

Actual results:

A number of failed jobs get populated instead of maintaining 2(as mentioned in the yaml definition).

Expected results:

2 failed jobs should be maintained.

Additional info:

The customer is testing this feature and came across this.

Comment 2 Maciej Szulik 2018-04-11 15:14:00 UTC
This is working as expected. When you set the restartPolicy to OnFailure the kubelet is responsible for restarting the pod. Iow. it will retry the pod about 6 times, each with a longer delay (10s, 20s, etc.), see [1] for details. 
This means that it will longer time for the pod to actually fail (the controller does not treat CrashLoopBackOff as a failure). But once it reaches the failed state (RunContainerError) the controller ensures there are no more than 2 failed jobs.

When you set the restart policy to Never then you can tweak the backoffLimit parameter, there is more details about the topic in [2].


[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
[2] https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#handling-pod-and-container-failures

Comment 9 Maciej Szulik 2018-05-10 09:58:42 UTC
Here's a sample failing cronjob:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  failedJobsHistoryLimit: 1
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      backoffLimit: 1
      template:
        metadata:
          name: hello
          labels:
            job: test
        spec:
          containers:
          - name: hello
            image: busybox
            command: ["/bin/sh",  "-c", "exit 1"]
          restartPolicy: Never

Notice two elements:
- backoffLimit - which will allow the job to fail fast
- the command that results in container to fail


Note You need to log in before you can comment on or make changes to this bug.