Bug 2058671

Summary:	whereabouts IPAM CNI ip-reconciler cronjob specification requires hostnetwork, api-int lb usage & proper backoff
Product:	OpenShift Container Platform	Reporter:	Douglas Smith <dosmith>
Component:	Networking	Assignee:	Douglas Smith <dosmith>
Networking sub component:	multus	QA Contact:	Weibin Liang <weliang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	akanekar, bmehra, jdee, kgordeev, lmohanty, morgan.peterman, wking
Version:	4.10	Keywords:	Upgrades
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The ip reconciliation cronjob for Whereabouts IPAM CNI could fail due to api connectivity issues, or api server timeouts. Consequence: The cronjob could fail intermittently. In most cases this failure has no impact on customer clusters, and it will succeed on subsequent runs. Fix: Set the job to use the api-internal server address as well as extended the api timeouts. Result: Determined that this cronjob should only be launched on clusters that use Whereabouts IPAM CNI actively.	Story Points:	---
Clone Of:
Clones:	2058672 (view as bug list)		Environment:
Last Closed:	2022-08-10 10:51:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2058672

Description Douglas Smith 2022-02-25 15:14:52 UTC

Description of problem: A number of changes related to the ip-reconciler ( need to be properly implemented, these include:

Impact: Without the proper backoff and replacement policies, many failed jobs can build up. Additionally without hostnetworking and use of the api-int lb network connectivity problems which cause errors.

Note: A set of changes to the ip-reconciler itself

Fixes to include in this (and subsequent backports) include:

* auto clean failed jobs (https://github.com/openshift/cluster-network-operator/pull/1318)
* Use host network and api-int (https://github.com/openshift/cluster-network-operator/pull/1302)
* Disable retries on failure (https://github.com/openshift/cluster-network-operator/pull/1290)

Comment 3 Douglas Smith 2022-03-03 21:06:36 UTC

To verify:

run: oc get cronjob ip-reconciler -o yaml -n openshift-multus | grep -Pi "KUBERNETES_SERVICE_PORT|KUBERNETES_SERVICE_HOST|failedJobsHistoryLimit|backoffLimit|hostNetwork"

which should result in:

  failedJobsHistoryLimit: 1
      backoffLimit: 0
            - name: KUBERNETES_SERVICE_PORT
            - name: KUBERNETES_SERVICE_HOST
          hostNetwork: true

Thank you!

Comment 4 Weibin Liang 2022-03-03 21:09:22 UTC

[weliang@weliang openshift-tests-private]$ oc get cronjob ip-reconciler -o yaml -n openshift-multus | grep -Pi "KUBERNETES_SERVICE_PORT|KUBERNETES_SERVICE_HOST|failedJobsHistoryLimit|backoffLimit|hostNetwork"
  failedJobsHistoryLimit: 1
      backoffLimit: 0
            - name: KUBERNETES_SERVICE_PORT
            - name: KUBERNETES_SERVICE_HOST
          hostNetwork: true
[weliang@weliang openshift-tests-private]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-02-27-122819   True        False         6h18m   Cluster version is 4.11.0-0.nightly-2022-02-27-122819
[weliang@weliang openshift-tests-private]$

Comment 8 errata-xmlrpc 2022-08-10 10:51:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069