Bug 2069527
Summary: | ip reconciler pods are not getting deleted and their IP addresses not released | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Swapnil Dalela <sdalela> |
Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> |
Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.9 | CC: | aos-bugs, dosmith, knarra, maszulik, mfojtik, qiwan, rphillips, yinzhou, zhouyingfu |
Target Milestone: | --- | Flags: | zhouyingfu:
needinfo-
|
Target Release: | 4.9.z | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-05-12 20:40:46 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Swapnil Dalela
2022-03-29 05:35:31 UTC
Do you mind pasting the spec of the ip-reconciler cronjob ? AFAIU, this should not happen, since we're setting the `successfulJobsHistoryLimit` to 0: https://github.com/openshift/cluster-network-operator/blob/release-4.9/bindata/network/multus/multus.yaml#L471 Furthermore, it's very weird to see that many instance of the reconciler. According to the docs, it should preserve 3 by default; quoting from the Kubernetes API reference [0]: """ successfulJobsHistoryLimit The number of successful finished jobs to retain. Value must be non-negative integer. Defaults to 3. """ [0] - https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/#cronjobspec-v1-batch Please find the requested spec below: apiVersion: batch/v1 kind: CronJob metadata: creationTimestamp: "2022-03-01T07:38:51Z" generation: 1 labels: app: whereabouts tier: node name: ip-reconciler namespace: openshift-multus ownerReferences: - apiVersion: operator.openshift.io/v1 blockOwnerDeletion: true controller: true kind: Network name: cluster uid: dae6ae8e-db47-448e-bc95-aaedaa72fa0c resourceVersion: "302184365" uid: 44788a07-2e56-435e-a398-9e95c35c4f1e spec: concurrencyPolicy: Replace failedJobsHistoryLimit: 1 jobTemplate: metadata: creationTimestamp: null spec: template: metadata: creationTimestamp: null spec: containers: - command: - /ip-reconciler - -log-level=verbose image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0a22b43fc7350228e762993bbecd9416d5e8aa97b5579c63a66bd9df7965f857 imagePullPolicy: IfNotPresent name: whereabouts resources: requests: cpu: 25m memory: 25Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /host/etc/cni/net.d name: cni-net-dir dnsPolicy: ClusterFirst priorityClassName: system-cluster-critical restartPolicy: Never schedulerName: default-scheduler securityContext: {} serviceAccount: multus serviceAccountName: multus terminationGracePeriodSeconds: 30 volumes: - hostPath: path: /etc/kubernetes/cni/net.d type: "" name: cni-net-dir schedule: '*/15 * * * *' successfulJobsHistoryLimit: 0 suspend: false status: lastScheduleTime: "2022-03-30T05:30:00Z" lastSuccessfulTime: "2022-03-30T00:30:05Z" If you need to see the must gather, I have linked the case and it should be available there. This shouldn't be happening, the `successfulJobsHistoryLimit` is configured to 0. That setting is *not* being honored. I think this is a core kubernetes bug, maybe ? @ I did not finish writing comment#3; @dosmith I think this is a core kubernetes bug, not sure if we should update the component. Miguel -- I agree. Let's get a look from the kubelet side to see if there's a reason why the pod wasn't removed (please re-assign otherwise if kubelet isn't the appropriate component for this, e.g. it's the api object and not the pod on the node, for example) Hello team, Is there any update that I can share with the cu? This issue is causing ip address pool to deplete and this causes issues with new pods. Thanks. @sdalela This might be related to https://github.com/kubernetes/kubernetes/pull/104799. They could use 4.10 which already has the patch. Moving for verification by qa, since https://github.com/openshift/kubernetes/pull/1223 which includes https://github.com/kubernetes/kubernetes/pull/104799 merged. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.32 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1694 I assume the customer is still on older version of OCP, since this merged and got released recently. I'd suggest the customer to upgrade the cluster and report back when this is still happening. qe_test_coverage flag here is set to '-' as based on the above comments it looks like not a bug. |