Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1920182

Summary: Job pods orphaned after Job deleted
Product: OpenShift Container Platform Reporter: brad.williams
Component: kube-controller-managerAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: low    
Version: 4.6CC: aos-bugs, jchaloup, jupierce, maszulik, mfojtik
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: LifecycleReset
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-23 18:29:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Release Creation Job definition
none
Release Creation Job Pod definition
none
Release Verification Job definition
none
Release Verification Job Pod definition none

Description brad.williams 2021-01-25 17:31:34 UTC
Description of problem:
We currently see a large number of "Completed" pods, created by jobs, accumulating after the parent Job has been deleted.

Version-Release number of selected component (if applicable):
$ app get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.13    True        False         29h     Cluster version is 4.6.13


How reproducible:
There was a large amount (hundreds) when we initially noticed the issue.  While trying to gather enough information to open this BZ, something purged the system.  Since then, the amount of orphaned pods has been steadily increasing.  


Steps to Reproduce:
1. Create a release creation/verification job
2. Job eventually gets deleted
3. Observe orphaned pod

Actual results:
Pods are being orphaned when their respective parent is Deleted

Expected results:
The child pods, from jobs, should be deleted when the job is deleted.

Additional information:
The pods do have an "ownerRef" specified that points to the corresponding Job that created it.

Comment 1 brad.williams 2021-01-25 17:33:36 UTC
Created attachment 1750643 [details]
Release Creation Job definition

Comment 2 brad.williams 2021-01-25 17:34:03 UTC
Created attachment 1750644 [details]
Release Creation Job Pod definition

Comment 3 brad.williams 2021-01-25 17:34:40 UTC
Created attachment 1750645 [details]
Release Verification Job definition

Comment 4 brad.williams 2021-01-25 17:35:22 UTC
Created attachment 1750646 [details]
Release Verification Job Pod definition

Comment 7 Maciej Szulik 2021-02-05 14:27:11 UTC
I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level.

Comment 8 Michal Fojtik 2021-03-07 15:26:52 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 9 brad.williams 2021-03-10 22:49:55 UTC
I still see evidence of this issue occurring on our CI cluster:

$ oc --context app.ci get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        False         2d6h    Cluster version is 4.7.1

$ oc --context -n ci-release get pods 4.9.0-0.ci-2021-03-09-043953-ln5t8 -o json | jq '.metadata.labels["job-name"]'
"4.9.0-0.ci-2021-03-09-043953"

$ oc --context -n ci-release get jobs 4.9.0-0.ci-2021-03-09-043953
Error from server (NotFound): jobs.batch "4.9.0-0.ci-2021-03-09-043953" not found

Comment 10 Michal Fojtik 2021-03-10 23:07:26 UTC
The LifecycleStale keyword was removed because the needinfo? flag was reset and the bug got commented on recently.
The bug assignee was notified.

Comment 11 ravig 2021-04-09 12:22:19 UTC
Spent some time looking into bug, this sprint. I was initially waiting on someone to increase log-levels for kube-controller-manager. Finally, I was given cluster-admin privilege this week. I'll look more into it next sprint.

Comment 12 Michal Fojtik 2021-05-09 13:14:28 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 13 Maciej Szulik 2021-05-19 13:39:02 UTC
This was accidentally marked as blocker+ and poped up today during reviewing 4.8 blocker bugs, I'm dropping blocker+ from this bug, since this is not a regression nor blocking any working of a cluster, we'll work on it as soon as possible but it doesn't deserve blocker status per se.

Comment 14 Michal Fojtik 2021-06-18 14:29:45 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 16 Michal Fojtik 2021-08-17 15:53:28 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 17 Michal Fojtik 2021-09-16 16:00:54 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 19 Michal Fojtik 2021-11-24 15:09:35 UTC
The LifecycleStale keyword was removed because the bug moved to QE.
The bug assignee was notified.

Comment 23 Maciej Szulik 2022-07-27 14:04:51 UTC
Jan https://github.com/kubernetes/enhancements/issues/592 is what was discussed in the past, which allows job authors to set an annotation 
which then cleans the job after given time after its completion. This was GA-ed in k8s 1.23, so OCP 4.10 and that's the correct way
to clean the jobs.

Comment 24 Jan Chaloupka 2022-08-02 06:56:17 UTC
Thank You Maciej for the confirmation.

> From the godoc, of `ttlSecondsAfterFinished`, it sounds like we need to leave this unset to continue performing our GC this way.

Brad, would it help to set `ttlSecondsAfterFinished` to a value higher than the GC's cleaning period?

Comment 28 zhou ying 2022-08-08 06:08:38 UTC
tested job with .spec.ttlSecondsAfterFinished  field , when the job completed , the job and pods are all deleted . 
oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.26   True        False         4h1m    Cluster version is 4.10.26

Comment 29 brad.williams 2022-08-08 15:20:42 UTC
I have also verified that setting .spec.ttlSecondsAfterFinished resolves the issue.
Thanks!

Comment 34 errata-xmlrpc 2022-08-23 18:29:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.28 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:6095