2065488 – ip-reconciler job does not complete, halts node drain

Bug 2065488 - ip-reconciler job does not complete, halts node drain

Summary: ip-reconciler job does not complete, halts node drain

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Douglas Smith
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Depends On:	2065785
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-18 00:26 UTC by Matt Bargenquast
Modified:	2022-04-21 13:16 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2065785 (view as bug list)
Environment:
Last Closed:	2022-04-21 13:16:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift whereabouts-cni pull 88	0	None	Merged	Bug 2065488: Sync upstream for context improvements for reconciler [backport 4.10]	2022-04-14 23:35:12 UTC
Red Hat Product Errata	RHSA-2022:1356	0	None	None	None	2022-04-21 13:16:19 UTC

Description Matt Bargenquast 2022-03-18 00:26:31 UTC

Description of problem:

The cluster needed to drain and replace a node, which had the ip-reconciler job pod running on it. The drain was stalled with the ip-reconciler pod stuck in a Terminating state.

The logs of the ip-reconciler pod seemed to indicate that it never successfully ran:

---
2022-03-17T21:16:18Z [error] failed to instantiate the Kubernetes client: Get "https://api-int.zen-aws.rqyj.p1.openshiftapps.com:6443/api?timeout=32s": dial tcp 10.0.203.87:6443: connect: connection refused
2022-03-17T21:16:18Z [error] failed to create the reconcile looper: failed to instantiate the Kubernetes client: Get "https://api-int.zen-aws.rqyj.p1.openshiftapps.com:6443/api?timeout=32s": dial tcp 10.0.203.87:6443: connect: connection refused
---


Version-Release number of selected component (if applicable):

4.10.3

Actual results:


Expected results:
The ip-reconciler job should not result in the cluster getting into state that blocks the node drain.

Additional info:

Comment 2 Douglas Smith 2022-03-18 18:20:34 UTC

Thanks for the report. There's a couple parts for this, as we're in the midst of a backport of a number of fixes for this cronjob.

I've gone ahead and add this to our known errors so that hopefully the process exists cleanly in the future. It's also been noted that some of the other pending backports may also aid in this situation as well.

The PR to watch is: https://github.com/openshift/whereabouts-cni/pull/88 for 4.10.

Comment 8 Weibin Liang 2022-04-14 14:20:14 UTC

Following verifying steps from https://gist.github.com/dougbtv/b84c5dec4953f4b85048d16ddcf72c15,testing pass in 4.10.10:

[weliang@weliang ~]$ oc get cronjob ip-reconciler -o yaml | grep -vP "creationTimestamp|\- apiVersion|ownerReferences|blockOwnerDeletion|controller|kind\: Network|name\: cluster|uid\:|resourceVersion" | sed 's/name: ip-reconciler/name: test-reconciler/' | sed '/            - -log-level=verbose/a \ \ \ \ \ \ \ \ \ \ \ \ - -timeout=invalid' > /tmp/reconcile.yml
Error from server (NotFound): cronjobs.batch "ip-reconciler" not found
[weliang@weliang ~]$ oc project openshift-multus
Now using project "openshift-multus" on server "https://api.weliang-4142.qe.gcp.devcluster.openshift.com:6443".
[weliang@weliang ~]$ oc get cronjob ip-reconciler -o yaml | grep -vP "creationTimestamp|\- apiVersion|ownerReferences|blockOwnerDeletion|controller|kind\: Network|name\: cluster|uid\:|resourceVersion" | sed 's/name: ip-reconciler/name: test-reconciler/' | sed '/            - -log-level=verbose/a \ \ \ \ \ \ \ \ \ \ \ \ - -timeout=invalid' > /tmp/reconcile.yml
[weliang@weliang ~]$ oc create -f /tmp/reconcile.yml
cronjob.batch/test-reconciler created
[weliang@weliang ~]$ oc create job --from=cronjob/test-reconciler -n openshift-multus testrun-ip-reconciler
job.batch/testrun-ip-reconciler created
[weliang@weliang ~]$ oc get pods | grep testrun
testrun-ip-reconciler-pmzs6           0/1     Error     0          6s
[weliang@weliang ~]$ oc logs testrun-ip-reconciler-pmzs6
invalid value "invalid" for flag -timeout: parse error
Usage of /ip-reconciler:
  -kubeconfig string
    	the path to the Kubernetes configuration file
  -log-level ip-reconciler
    	the logging level for the ip-reconciler app. Valid values are: "debug", "verbose", "error", and "panic". (default "error")
  -timeout int
    	the value for a request timeout in seconds. (default 30)
[weliang@weliang ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.10   True        False         18m     Cluster version is 4.10.10
[weliang@weliang ~]$

Comment 10 errata-xmlrpc 2022-04-21 13:16:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.10 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1356

Note You need to log in before you can comment on or make changes to this bug.