1565048 – failedJobsHistoryLimit field does not work as expected in a cron job

Bug 1565048 - failedJobsHistoryLimit field does not work as expected in a cron job

Summary: failedJobsHistoryLimit field does not work as expected in a cron job

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Maciej Szulik
QA Contact:	Wang Haoran
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-09 08:59 UTC by Fatima
Modified:	2021-06-10 15:43 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-05-07 13:45:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Fatima 2018-04-09 08:59:25 UTC

Description of problem:
The field failedJobsHistoryLimit does not work as expected. The pods created by the cronjob show ContainerCannotRun 

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:

1. Create cron job vi cronjob.yaml

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: cron-job-test
spec:
  schedule: "*/1 * * * *"
  successfulJobsHistoryLimit: 4
  failedJobsHistoryLimit: 2
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cron-job-test
            image: perl
            command: ["perl5",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
          restartPolicy: OnFailure

  --->

  Here we are using perl5 (which is wrong), so that atleast 2 failed jobs are retained.

2. oc create -f cronjob.yaml

3. oc get jobs -w

Actual results:

A number of failed jobs get populated instead of maintaining 2(as mentioned in the yaml definition).

Expected results:

2 failed jobs should be maintained.

Additional info:

The customer is testing this feature and came across this.

Comment 2 Maciej Szulik 2018-04-11 15:14:00 UTC

This is working as expected. When you set the restartPolicy to OnFailure the kubelet is responsible for restarting the pod. Iow. it will retry the pod about 6 times, each with a longer delay (10s, 20s, etc.), see [1] for details. 
This means that it will longer time for the pod to actually fail (the controller does not treat CrashLoopBackOff as a failure). But once it reaches the failed state (RunContainerError) the controller ensures there are no more than 2 failed jobs.

When you set the restart policy to Never then you can tweak the backoffLimit parameter, there is more details about the topic in [2].


[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
[2] https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#handling-pod-and-container-failures

Comment 9 Maciej Szulik 2018-05-10 09:58:42 UTC

Here's a sample failing cronjob:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  failedJobsHistoryLimit: 1
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      backoffLimit: 1
      template:
        metadata:
          name: hello
          labels:
            job: test
        spec:
          containers:
          - name: hello
            image: busybox
            command: ["/bin/sh",  "-c", "exit 1"]
          restartPolicy: Never

Notice two elements:
- backoffLimit - which will allow the job to fail fast
- the command that results in container to fail

Note You need to log in before you can comment on or make changes to this bug.