Bug 1565048

Summary:	failedJobsHistoryLimit field does not work as expected in a cron job
Product:	OpenShift Container Platform	Reporter:	Fatima <fshaikh>
Component:	Master	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED NOTABUG	QA Contact:	Wang Haoran <haowang>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.9.0	CC:	aos-bugs, byount, jokerman, maszulik, mfojtik, mmccomas, sgaikwad
Target Milestone:	---	Keywords:	Reopened
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-07 13:45:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Fatima 2018-04-09 08:59:25 UTC

Description of problem:
The field failedJobsHistoryLimit does not work as expected. The pods created by the cronjob show ContainerCannotRun 

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:

1. Create cron job vi cronjob.yaml

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: cron-job-test
spec:
  schedule: "*/1 * * * *"
  successfulJobsHistoryLimit: 4
  failedJobsHistoryLimit: 2
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cron-job-test
            image: perl
            command: ["perl5",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
          restartPolicy: OnFailure

  --->

  Here we are using perl5 (which is wrong), so that atleast 2 failed jobs are retained.

2. oc create -f cronjob.yaml

3. oc get jobs -w

Actual results:

A number of failed jobs get populated instead of maintaining 2(as mentioned in the yaml definition).

Expected results:

2 failed jobs should be maintained.

Additional info:

The customer is testing this feature and came across this.

Comment 2 Maciej Szulik 2018-04-11 15:14:00 UTC

This is working as expected. When you set the restartPolicy to OnFailure the kubelet is responsible for restarting the pod. Iow. it will retry the pod about 6 times, each with a longer delay (10s, 20s, etc.), see [1] for details. 
This means that it will longer time for the pod to actually fail (the controller does not treat CrashLoopBackOff as a failure). But once it reaches the failed state (RunContainerError) the controller ensures there are no more than 2 failed jobs.

When you set the restart policy to Never then you can tweak the backoffLimit parameter, there is more details about the topic in [2].


[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
[2] https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#handling-pod-and-container-failures

Comment 9 Maciej Szulik 2018-05-10 09:58:42 UTC

Here's a sample failing cronjob:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  failedJobsHistoryLimit: 1
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      backoffLimit: 1
      template:
        metadata:
          name: hello
          labels:
            job: test
        spec:
          containers:
          - name: hello
            image: busybox
            command: ["/bin/sh",  "-c", "exit 1"]
          restartPolicy: Never

Notice two elements:
- backoffLimit - which will allow the job to fail fast
- the command that results in container to fail