Bug 2069527
| Summary: | ip reconciler pods are not getting deleted and their IP addresses not released | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Swapnil Dalela <sdalela> |
| Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> |
| Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.9 | CC: | aos-bugs, dosmith, knarra, maszulik, mfojtik, qiwan, rphillips, yinzhou, zhouyingfu |
| Target Milestone: | --- | Flags: | zhouyingfu:
needinfo-
|
| Target Release: | 4.9.z | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-12 20:40:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Swapnil Dalela
2022-03-29 05:35:31 UTC
Do you mind pasting the spec of the ip-reconciler cronjob ? AFAIU, this should not happen, since we're setting the `successfulJobsHistoryLimit` to 0: https://github.com/openshift/cluster-network-operator/blob/release-4.9/bindata/network/multus/multus.yaml#L471 Furthermore, it's very weird to see that many instance of the reconciler. According to the docs, it should preserve 3 by default; quoting from the Kubernetes API reference [0]: """ successfulJobsHistoryLimit The number of successful finished jobs to retain. Value must be non-negative integer. Defaults to 3. """ [0] - https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/#cronjobspec-v1-batch Please find the requested spec below:
apiVersion: batch/v1
kind: CronJob
metadata:
creationTimestamp: "2022-03-01T07:38:51Z"
generation: 1
labels:
app: whereabouts
tier: node
name: ip-reconciler
namespace: openshift-multus
ownerReferences:
- apiVersion: operator.openshift.io/v1
blockOwnerDeletion: true
controller: true
kind: Network
name: cluster
uid: dae6ae8e-db47-448e-bc95-aaedaa72fa0c
resourceVersion: "302184365"
uid: 44788a07-2e56-435e-a398-9e95c35c4f1e
spec:
concurrencyPolicy: Replace
failedJobsHistoryLimit: 1
jobTemplate:
metadata:
creationTimestamp: null
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- /ip-reconciler
- -log-level=verbose
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0a22b43fc7350228e762993bbecd9416d5e8aa97b5579c63a66bd9df7965f857
imagePullPolicy: IfNotPresent
name: whereabouts
resources:
requests:
cpu: 25m
memory: 25Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/etc/cni/net.d
name: cni-net-dir
dnsPolicy: ClusterFirst
priorityClassName: system-cluster-critical
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: multus
serviceAccountName: multus
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /etc/kubernetes/cni/net.d
type: ""
name: cni-net-dir
schedule: '*/15 * * * *'
successfulJobsHistoryLimit: 0
suspend: false
status:
lastScheduleTime: "2022-03-30T05:30:00Z"
lastSuccessfulTime: "2022-03-30T00:30:05Z"
If you need to see the must gather, I have linked the case and it should be available there.
This shouldn't be happening, the `successfulJobsHistoryLimit` is configured to 0. That setting is *not* being honored. I think this is a core kubernetes bug, maybe ? @ I did not finish writing comment#3; @dosmith I think this is a core kubernetes bug, not sure if we should update the component. Miguel -- I agree. Let's get a look from the kubelet side to see if there's a reason why the pod wasn't removed (please re-assign otherwise if kubelet isn't the appropriate component for this, e.g. it's the api object and not the pod on the node, for example) Hello team, Is there any update that I can share with the cu? This issue is causing ip address pool to deplete and this causes issues with new pods. Thanks. @sdalela This might be related to https://github.com/kubernetes/kubernetes/pull/104799. They could use 4.10 which already has the patch. Moving for verification by qa, since https://github.com/openshift/kubernetes/pull/1223 which includes https://github.com/kubernetes/kubernetes/pull/104799 merged. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.32 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1694 I assume the customer is still on older version of OCP, since this merged and got released recently. I'd suggest the customer to upgrade the cluster and report back when this is still happening. qe_test_coverage flag here is set to '-' as based on the above comments it looks like not a bug. |