Description of problem: ip reconciler pods which are scheduled by the cronjob gets completed successfully but they are not getting deleted even after 7 days. Because of how ovn works, their IP are not released until deleted. Version-Release number of selected component (if applicable): How reproducible: Not sure Steps to Reproduce: 1. Check the number of succeeded jobs in openshift-multus project Actual results: Pods completed 7 days ago are still there in the project Expected results: Pods completed few days back should be deleted automatically. Additional info: ovn discussion regarding IP release issue: https://bugzilla.redhat.com/show_bug.cgi?id=2026461
Do you mind pasting the spec of the ip-reconciler cronjob ? AFAIU, this should not happen, since we're setting the `successfulJobsHistoryLimit` to 0: https://github.com/openshift/cluster-network-operator/blob/release-4.9/bindata/network/multus/multus.yaml#L471 Furthermore, it's very weird to see that many instance of the reconciler. According to the docs, it should preserve 3 by default; quoting from the Kubernetes API reference [0]: """ successfulJobsHistoryLimit The number of successful finished jobs to retain. Value must be non-negative integer. Defaults to 3. """ [0] - https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/#cronjobspec-v1-batch
Please find the requested spec below: apiVersion: batch/v1 kind: CronJob metadata: creationTimestamp: "2022-03-01T07:38:51Z" generation: 1 labels: app: whereabouts tier: node name: ip-reconciler namespace: openshift-multus ownerReferences: - apiVersion: operator.openshift.io/v1 blockOwnerDeletion: true controller: true kind: Network name: cluster uid: dae6ae8e-db47-448e-bc95-aaedaa72fa0c resourceVersion: "302184365" uid: 44788a07-2e56-435e-a398-9e95c35c4f1e spec: concurrencyPolicy: Replace failedJobsHistoryLimit: 1 jobTemplate: metadata: creationTimestamp: null spec: template: metadata: creationTimestamp: null spec: containers: - command: - /ip-reconciler - -log-level=verbose image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0a22b43fc7350228e762993bbecd9416d5e8aa97b5579c63a66bd9df7965f857 imagePullPolicy: IfNotPresent name: whereabouts resources: requests: cpu: 25m memory: 25Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /host/etc/cni/net.d name: cni-net-dir dnsPolicy: ClusterFirst priorityClassName: system-cluster-critical restartPolicy: Never schedulerName: default-scheduler securityContext: {} serviceAccount: multus serviceAccountName: multus terminationGracePeriodSeconds: 30 volumes: - hostPath: path: /etc/kubernetes/cni/net.d type: "" name: cni-net-dir schedule: '*/15 * * * *' successfulJobsHistoryLimit: 0 suspend: false status: lastScheduleTime: "2022-03-30T05:30:00Z" lastSuccessfulTime: "2022-03-30T00:30:05Z" If you need to see the must gather, I have linked the case and it should be available there.
This shouldn't be happening, the `successfulJobsHistoryLimit` is configured to 0. That setting is *not* being honored. I think this is a core kubernetes bug, maybe ? @
I did not finish writing comment#3; @dosmith I think this is a core kubernetes bug, not sure if we should update the component.
Miguel -- I agree. Let's get a look from the kubelet side to see if there's a reason why the pod wasn't removed (please re-assign otherwise if kubelet isn't the appropriate component for this, e.g. it's the api object and not the pod on the node, for example)
Hello team, Is there any update that I can share with the cu? This issue is causing ip address pool to deplete and this causes issues with new pods. Thanks.
@sdalela This might be related to https://github.com/kubernetes/kubernetes/pull/104799. They could use 4.10 which already has the patch.
Moving for verification by qa, since https://github.com/openshift/kubernetes/pull/1223 which includes https://github.com/kubernetes/kubernetes/pull/104799 merged.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.32 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1694
I assume the customer is still on older version of OCP, since this merged and got released recently. I'd suggest the customer to upgrade the cluster and report back when this is still happening.
qe_test_coverage flag here is set to '-' as based on the above comments it looks like not a bug.