Bug 2058671 - whereabouts IPAM CNI ip-reconciler cronjob specification requires hostnetwork, api-int lb usage & proper backoff
Summary: whereabouts IPAM CNI ip-reconciler cronjob specification requires hostnetwork...
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Douglas Smith
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On:
Blocks: 2058672
TreeView+ depends on / blocked
 
Reported: 2022-02-25 15:14 UTC by Douglas Smith
Modified: 2022-07-20 20:37 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The ip reconciliation cronjob for Whereabouts IPAM CNI could fail due to api connectivity issues, or api server timeouts. Consequence: The cronjob could fail intermittently. In most cases this failure has no impact on customer clusters, and it will succeed on subsequent runs. Fix: Set the job to use the api-internal server address as well as extended the api timeouts. Result: Determined that this cronjob should only be launched on clusters that use Whereabouts IPAM CNI actively.
Clone Of:
: 2058672 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1318 0 None Merged Bug 2058671: ip reconciler: auto clean failed jobs 2022-06-23 23:45:33 UTC
Red Hat Knowledge Base (Solution) 6870061 0 None None None 2022-03-31 20:13:19 UTC

Description Douglas Smith 2022-02-25 15:14:52 UTC
Description of problem: A number of changes related to the ip-reconciler ( need to be properly implemented, these include:

Impact: Without the proper backoff and replacement policies, many failed jobs can build up. Additionally without hostnetworking and use of the api-int lb network connectivity problems which cause errors.

Note: A set of changes to the ip-reconciler itself

Fixes to include in this (and subsequent backports) include:

* auto clean failed jobs (https://github.com/openshift/cluster-network-operator/pull/1318)
* Use host network and api-int (https://github.com/openshift/cluster-network-operator/pull/1302)
* Disable retries on failure (https://github.com/openshift/cluster-network-operator/pull/1290)

Comment 3 Douglas Smith 2022-03-03 21:06:36 UTC
To verify:

run: oc get cronjob ip-reconciler -o yaml -n openshift-multus | grep -Pi "KUBERNETES_SERVICE_PORT|KUBERNETES_SERVICE_HOST|failedJobsHistoryLimit|backoffLimit|hostNetwork"

which should result in:

  failedJobsHistoryLimit: 1
      backoffLimit: 0
            - name: KUBERNETES_SERVICE_PORT
            - name: KUBERNETES_SERVICE_HOST
          hostNetwork: true

Thank you!

Comment 4 Weibin Liang 2022-03-03 21:09:22 UTC
[weliang@weliang openshift-tests-private]$ oc get cronjob ip-reconciler -o yaml -n openshift-multus | grep -Pi "KUBERNETES_SERVICE_PORT|KUBERNETES_SERVICE_HOST|failedJobsHistoryLimit|backoffLimit|hostNetwork"
  failedJobsHistoryLimit: 1
      backoffLimit: 0
            - name: KUBERNETES_SERVICE_PORT
            - name: KUBERNETES_SERVICE_HOST
          hostNetwork: true
[weliang@weliang openshift-tests-private]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-02-27-122819   True        False         6h18m   Cluster version is 4.11.0-0.nightly-2022-02-27-122819
[weliang@weliang openshift-tests-private]$


Note You need to log in before you can comment on or make changes to this bug.