Bug 2069527 - ip reconciler pods are not getting deleted and their IP addresses not released
Summary: ip reconciler pods are not getting deleted and their IP addresses not released
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.9
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.9.z
Assignee: Maciej Szulik
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-29 05:35 UTC by Swapnil Dalela
Modified: 2022-11-21 16:40 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-12 20:40:46 UTC
Target Upstream Version:
Embargoed:
zhouyingfu: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2022:1694 0 None None None 2022-05-12 20:41:01 UTC

Description Swapnil Dalela 2022-03-29 05:35:31 UTC
Description of problem:

ip reconciler pods which are scheduled by the cronjob gets completed successfully but they are not getting deleted even after 7 days. Because of how ovn works, their IP are not released until deleted.


Version-Release number of selected component (if applicable):


How reproducible:

Not sure

Steps to Reproduce:
1. Check the number of succeeded jobs in openshift-multus project 

Actual results:

Pods completed 7 days ago are still there in the project

Expected results:

Pods completed few days back should be deleted automatically.

Additional info:

ovn discussion regarding IP release issue: https://bugzilla.redhat.com/show_bug.cgi?id=2026461

Comment 1 Miguel Duarte Barroso 2022-03-29 15:42:21 UTC
Do you mind pasting the spec of the ip-reconciler cronjob ? 

AFAIU, this should not happen, since we're setting the `successfulJobsHistoryLimit` to 0: 
https://github.com/openshift/cluster-network-operator/blob/release-4.9/bindata/network/multus/multus.yaml#L471

Furthermore, it's very weird to see that many instance of the reconciler. According to the docs, it should preserve 3 by default; quoting from the Kubernetes API reference [0]: 
"""
successfulJobsHistoryLimit    The number of successful finished jobs to retain. Value must be non-negative integer. Defaults to 3.
"""

[0] - https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/#cronjobspec-v1-batch

Comment 2 Swapnil Dalela 2022-03-30 12:31:36 UTC
Please find the requested spec below:

apiVersion: batch/v1
kind: CronJob
metadata:
  creationTimestamp: "2022-03-01T07:38:51Z"
  generation: 1
  labels:
    app: whereabouts
    tier: node
  name: ip-reconciler
  namespace: openshift-multus
  ownerReferences:
  - apiVersion: operator.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: Network
    name: cluster
    uid: dae6ae8e-db47-448e-bc95-aaedaa72fa0c
  resourceVersion: "302184365"
  uid: 44788a07-2e56-435e-a398-9e95c35c4f1e
spec:
  concurrencyPolicy: Replace
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command:
            - /ip-reconciler
            - -log-level=verbose
            image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0a22b43fc7350228e762993bbecd9416d5e8aa97b5579c63a66bd9df7965f857
            imagePullPolicy: IfNotPresent
            name: whereabouts
            resources:
              requests:
                cpu: 25m
                memory: 25Mi
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /host/etc/cni/net.d
              name: cni-net-dir
          dnsPolicy: ClusterFirst
          priorityClassName: system-cluster-critical
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: multus
          serviceAccountName: multus
          terminationGracePeriodSeconds: 30
          volumes:
          - hostPath:
              path: /etc/kubernetes/cni/net.d
              type: ""
            name: cni-net-dir
  schedule: '*/15 * * * *'
  successfulJobsHistoryLimit: 0
  suspend: false
status:
  lastScheduleTime: "2022-03-30T05:30:00Z"
  lastSuccessfulTime: "2022-03-30T00:30:05Z"

If you need to see the must gather, I have linked the case and it should be available there.

Comment 3 Miguel Duarte Barroso 2022-03-30 12:50:58 UTC
This shouldn't be happening, the `successfulJobsHistoryLimit` is configured to 0. That setting is *not* being honored. 

I think this is a core kubernetes bug, maybe ? @

Comment 4 Miguel Duarte Barroso 2022-03-30 12:53:32 UTC
I did not finish writing comment#3;

@dosmith I think this is a core kubernetes bug, not sure if we should update the component.

Comment 5 Douglas Smith 2022-03-30 15:26:06 UTC
Miguel -- I agree. Let's get a look from the kubelet side to see if there's a reason why the pod wasn't removed (please re-assign otherwise if kubelet isn't the appropriate component for this, e.g. it's the api object and not the pod on the node, for example)

Comment 6 Swapnil Dalela 2022-04-06 09:53:15 UTC
Hello team, Is there any update that I can share with the cu? This issue is causing ip address pool to deplete and this causes issues with new pods. Thanks.

Comment 7 Qi Wang 2022-04-06 16:36:56 UTC
@sdalela This might be related to https://github.com/kubernetes/kubernetes/pull/104799. They could use 4.10 which already has the patch.

Comment 14 Maciej Szulik 2022-04-19 11:02:44 UTC
Moving for verification by qa, since https://github.com/openshift/kubernetes/pull/1223 which includes https://github.com/kubernetes/kubernetes/pull/104799 merged.

Comment 28 errata-xmlrpc 2022-05-12 20:40:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.32 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1694

Comment 29 Maciej Szulik 2022-05-13 14:18:57 UTC
I assume the customer is still on older version of OCP, since this merged and got released recently. I'd suggest the customer to upgrade the cluster and report back when this is still happening.

Comment 31 RamaKasturi 2022-11-21 16:40:26 UTC
qe_test_coverage flag here is set to '-' as based on the above comments it looks like not a  bug.


Note You need to log in before you can comment on or make changes to this bug.