Bug 1920182
| Summary: | Job pods orphaned after Job deleted | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | brad.williams | ||||||||||
| Component: | kube-controller-manager | Assignee: | Jan Chaloupka <jchaloup> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> | ||||||||||
| Severity: | medium | Docs Contact: | |||||||||||
| Priority: | low | ||||||||||||
| Version: | 4.6 | CC: | aos-bugs, jchaloup, jupierce, maszulik, mfojtik | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | 4.10.z | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | LifecycleReset | ||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2022-08-23 18:29:02 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
brad.williams
2021-01-25 17:31:34 UTC
Created attachment 1750643 [details]
Release Creation Job definition
Created attachment 1750644 [details]
Release Creation Job Pod definition
Created attachment 1750645 [details]
Release Verification Job definition
Created attachment 1750646 [details]
Release Verification Job Pod definition
I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. I still see evidence of this issue occurring on our CI cluster: $ oc --context app.ci get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.1 True False 2d6h Cluster version is 4.7.1 $ oc --context -n ci-release get pods 4.9.0-0.ci-2021-03-09-043953-ln5t8 -o json | jq '.metadata.labels["job-name"]' "4.9.0-0.ci-2021-03-09-043953" $ oc --context -n ci-release get jobs 4.9.0-0.ci-2021-03-09-043953 Error from server (NotFound): jobs.batch "4.9.0-0.ci-2021-03-09-043953" not found The LifecycleStale keyword was removed because the needinfo? flag was reset and the bug got commented on recently. The bug assignee was notified. Spent some time looking into bug, this sprint. I was initially waiting on someone to increase log-levels for kube-controller-manager. Finally, I was given cluster-admin privilege this week. I'll look more into it next sprint. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. This was accidentally marked as blocker+ and poped up today during reviewing 4.8 blocker bugs, I'm dropping blocker+ from this bug, since this is not a regression nor blocking any working of a cluster, we'll work on it as soon as possible but it doesn't deserve blocker status per se. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. The LifecycleStale keyword was removed because the bug moved to QE. The bug assignee was notified. Jan https://github.com/kubernetes/enhancements/issues/592 is what was discussed in the past, which allows job authors to set an annotation which then cleans the job after given time after its completion. This was GA-ed in k8s 1.23, so OCP 4.10 and that's the correct way to clean the jobs. Thank You Maciej for the confirmation.
> From the godoc, of `ttlSecondsAfterFinished`, it sounds like we need to leave this unset to continue performing our GC this way.
Brad, would it help to set `ttlSecondsAfterFinished` to a value higher than the GC's cleaning period?
tested job with .spec.ttlSecondsAfterFinished field , when the job completed , the job and pods are all deleted . oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.26 True False 4h1m Cluster version is 4.10.26 I have also verified that setting .spec.ttlSecondsAfterFinished resolves the issue. Thanks! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.28 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:6095 |