Bug 2058671

Summary: whereabouts IPAM CNI ip-reconciler cronjob specification requires hostnetwork, api-int lb usage & proper backoff
Product: OpenShift Container Platform Reporter: Douglas Smith <dosmith>
Component: NetworkingAssignee: Douglas Smith <dosmith>
Networking sub component: multus QA Contact: Weibin Liang <weliang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: akanekar, bmehra, jdee, kgordeev, lmohanty, morgan.peterman, wking
Version: 4.10Keywords: Upgrades
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The ip reconciliation cronjob for Whereabouts IPAM CNI could fail due to api connectivity issues, or api server timeouts. Consequence: The cronjob could fail intermittently. In most cases this failure has no impact on customer clusters, and it will succeed on subsequent runs. Fix: Set the job to use the api-internal server address as well as extended the api timeouts. Result: Determined that this cronjob should only be launched on clusters that use Whereabouts IPAM CNI actively.
Story Points: ---
Clone Of:
: 2058672 (view as bug list) Environment:
Last Closed: 2022-08-10 10:51:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2058672    

Description Douglas Smith 2022-02-25 15:14:52 UTC
Description of problem: A number of changes related to the ip-reconciler ( need to be properly implemented, these include:

Impact: Without the proper backoff and replacement policies, many failed jobs can build up. Additionally without hostnetworking and use of the api-int lb network connectivity problems which cause errors.

Note: A set of changes to the ip-reconciler itself

Fixes to include in this (and subsequent backports) include:

* auto clean failed jobs (https://github.com/openshift/cluster-network-operator/pull/1318)
* Use host network and api-int (https://github.com/openshift/cluster-network-operator/pull/1302)
* Disable retries on failure (https://github.com/openshift/cluster-network-operator/pull/1290)

Comment 3 Douglas Smith 2022-03-03 21:06:36 UTC
To verify:

run: oc get cronjob ip-reconciler -o yaml -n openshift-multus | grep -Pi "KUBERNETES_SERVICE_PORT|KUBERNETES_SERVICE_HOST|failedJobsHistoryLimit|backoffLimit|hostNetwork"

which should result in:

  failedJobsHistoryLimit: 1
      backoffLimit: 0
            - name: KUBERNETES_SERVICE_PORT
            - name: KUBERNETES_SERVICE_HOST
          hostNetwork: true

Thank you!

Comment 4 Weibin Liang 2022-03-03 21:09:22 UTC
[weliang@weliang openshift-tests-private]$ oc get cronjob ip-reconciler -o yaml -n openshift-multus | grep -Pi "KUBERNETES_SERVICE_PORT|KUBERNETES_SERVICE_HOST|failedJobsHistoryLimit|backoffLimit|hostNetwork"
  failedJobsHistoryLimit: 1
      backoffLimit: 0
            - name: KUBERNETES_SERVICE_PORT
            - name: KUBERNETES_SERVICE_HOST
          hostNetwork: true
[weliang@weliang openshift-tests-private]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-02-27-122819   True        False         6h18m   Cluster version is 4.11.0-0.nightly-2022-02-27-122819
[weliang@weliang openshift-tests-private]$

Comment 8 errata-xmlrpc 2022-08-10 10:51:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069