Bug 2057378

Summary:	Pods in Complete state are not removed
Product:	OpenShift Container Platform	Reporter:	Joel Rosental R. <jrosenta>
Component:	kube-controller-manager	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED DUPLICATE	QA Contact:	zhou ying <yinzhou>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	aos-bugs, mfojtik
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-04 15:57:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joel Rosental R. 2022-02-23 10:03:14 UTC

Description of problem:

After upgrading from 4.8 to 4.9.11 "completed" pods are not deleted anymore, e.g: cronjobs seems to ignore the successfulJobsHistoryLimit value.

In the KCM there are some occurrences of these lines:

~~~
2022-01-10T09:00:08.627571256Z E0110 09:00:08.627527       1 shared_informer.go:243] unable to sync caches for garbage collector
2022-01-10T09:00:08.627571256Z E0110 09:00:08.627541       1 garbagecollector.go:242] timed out waiting for dependency graph builder sync during GC sync (attempt 5559)
~~~

and no webhooks seem to be blocking GC to run.

Version-Release number of selected component (if applicable):
OCP 4.9.11

How reproducible:
Always (in customer env)

Steps to Reproduce:
1. Create any object that will create pods in "Completed" state, e.g: a cronjob, and set the "successfulJobsHistoryLimit" parameter.


Actual results:

Pods in "Completed" status last forever, e.g:

~~~
 oc get cronjob cronjob-ldap-group-sync -o yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  creationTimestamp: "2021-08-02T12:09:55Z"
  generation: 1
  labels:
    template: cronjob-ldap-group-sync-secure
    template.openshift.io/template-instance-owner: c09af01c-4c10-11ea-810b-0a580a80002e
  name: cronjob-ldap-group-sync
  namespace: oe930-cron
  resourceVersion: "476752196"
  uid: 7c11ecf3-86f3-4a3f-bccc-47a37a2f9764
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 5
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      backoffLimit: 0
      template:
        metadata:
          creationTimestamp: null
        spec:
          activeDeadlineSeconds: 500
          containers:
          - command:
            - /bin/bash
            - -c
            - oc adm groups sync --confirm --sync-config=/ldap-sync/config/ldap-group-sync.yaml
              $([ -s /ldap-sync/config/whitelist.txt ] && echo --whitelist=/ldap-sync/config/whitelist.txt)
            image: registry.redhat.io/openshift4/ose-cli:latest
            imagePullPolicy: IfNotPresent
            name: cronjob-ldap-group-sync
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /ldap-sync/config
              name: ldap-sync-config
            - mountPath: /ldap-sync/ca
              name: ldap-sync-ca
            - mountPath: /ldap-sync/secrets
              name: ldap-bind-password
          dnsPolicy: ClusterFirst
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: ldap-group-syncer
          serviceAccountName: ldap-group-syncer
          terminationGracePeriodSeconds: 30
          volumes:
          - configMap:
              defaultMode: 420
              name: ldap-group-sync
            name: ldap-sync-config
          - configMap:
              defaultMode: 420
              name: ldap-group-sync-ca
            name: ldap-sync-ca
          - name: ldap-bind-password
            secret:
              defaultMode: 420
              secretName: ldap-bind-password
  schedule: '@hourly'
  successfulJobsHistoryLimit: 0
  suspend: false
status:
  lastScheduleTime: "2022-01-11T09:00:00Z"
  lastSuccessfulTime: "2022-01-10T15:02:26Z"
# oc get pods
NAME                                        READY   STATUS      RESTARTS   AGE
cronjob-ldap-group-sync-27363900--1-gns4s   0/1     Completed   0          16h
cronjob-ldap-group-sync-27363960--1-mcrpq   0/1     Completed   0          15h
cronjob-ldap-group-sync-27364020--1-ckprd   0/1     Completed   0          14h
cronjob-ldap-group-sync-27364080--1-mt4sn   0/1     Completed   0          13h
cronjob-ldap-group-sync-27364140--1-q29wf   0/1     Completed   0          12h
cronjob-ldap-group-sync-27364200--1-n6hcg   0/1     Completed   0          11h
cronjob-ldap-group-sync-27364260--1-4bqm4   0/1     Completed   0          10h
cronjob-ldap-group-sync-27364320--1-kbv8b   0/1     Completed   0          9h
cronjob-ldap-group-sync-27364380--1-gwd9b   0/1     Completed   0          8h
cronjob-ldap-group-sync-27364440--1-n4jrp   0/1     Completed   0          7h15m
cronjob-ldap-group-sync-27364500--1-v458d   0/1     Completed   0          6h15m
cronjob-ldap-group-sync-27364560--1-cp29v   0/1     Completed   0          5h15m
cronjob-ldap-group-sync-27364620--1-6cnxn   0/1     Completed   0          4h15m
cronjob-ldap-group-sync-27364680--1-h5f4n   0/1     Completed   0          3h15m
cronjob-ldap-group-sync-27364740--1-zwmsn   0/1     Completed   0          135m
cronjob-ldap-group-sync-27364800--1-2zkvf   0/1     Completed   0          75m
cronjob-ldap-group-sync-27364860--1-4fnv2   0/1     Completed   0          15m
~~~

Expected results:

Completed pods should be cleaned by GC after a while and particularly cronjobs pods with successfulJobsHistoryLimit set, should be honoured.

Additional info:

Comment 6 Maciej Szulik 2022-03-04 15:57:53 UTC


*** This bug has been marked as a duplicate of bug 2050912 ***