Bug 2065488

Summary: ip-reconciler job does not complete, halts node drain
Product: OpenShift Container Platform Reporter: Matt Bargenquast <mbargenq>
Component: NetworkingAssignee: Douglas Smith <dosmith>
Networking sub component: multus QA Contact: Weibin Liang <weliang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: wking
Version: 4.10Keywords: ServiceDeliveryImpact
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2065785 (view as bug list) Environment:
Last Closed: 2022-04-21 13:16:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2065785    
Bug Blocks:    

Description Matt Bargenquast 2022-03-18 00:26:31 UTC
Description of problem:

The cluster needed to drain and replace a node, which had the ip-reconciler job pod running on it. The drain was stalled with the ip-reconciler pod stuck in a Terminating state.

The logs of the ip-reconciler pod seemed to indicate that it never successfully ran:

---
2022-03-17T21:16:18Z [error] failed to instantiate the Kubernetes client: Get "https://api-int.zen-aws.rqyj.p1.openshiftapps.com:6443/api?timeout=32s": dial tcp 10.0.203.87:6443: connect: connection refused
2022-03-17T21:16:18Z [error] failed to create the reconcile looper: failed to instantiate the Kubernetes client: Get "https://api-int.zen-aws.rqyj.p1.openshiftapps.com:6443/api?timeout=32s": dial tcp 10.0.203.87:6443: connect: connection refused
---


Version-Release number of selected component (if applicable):

4.10.3

Actual results:


Expected results:
The ip-reconciler job should not result in the cluster getting into state that blocks the node drain.

Additional info:

Comment 2 Douglas Smith 2022-03-18 18:20:34 UTC
Thanks for the report. There's a couple parts for this, as we're in the midst of a backport of a number of fixes for this cronjob.

I've gone ahead and add this to our known errors so that hopefully the process exists cleanly in the future. It's also been noted that some of the other pending backports may also aid in this situation as well.

The PR to watch is: https://github.com/openshift/whereabouts-cni/pull/88 for 4.10.

Comment 8 Weibin Liang 2022-04-14 14:20:14 UTC
Following verifying steps from https://gist.github.com/dougbtv/b84c5dec4953f4b85048d16ddcf72c15,testing pass in 4.10.10:

[weliang@weliang ~]$ oc get cronjob ip-reconciler -o yaml | grep -vP "creationTimestamp|\- apiVersion|ownerReferences|blockOwnerDeletion|controller|kind\: Network|name\: cluster|uid\:|resourceVersion" | sed 's/name: ip-reconciler/name: test-reconciler/' | sed '/            - -log-level=verbose/a \ \ \ \ \ \ \ \ \ \ \ \ - -timeout=invalid' > /tmp/reconcile.yml
Error from server (NotFound): cronjobs.batch "ip-reconciler" not found
[weliang@weliang ~]$ oc project openshift-multus
Now using project "openshift-multus" on server "https://api.weliang-4142.qe.gcp.devcluster.openshift.com:6443".
[weliang@weliang ~]$ oc get cronjob ip-reconciler -o yaml | grep -vP "creationTimestamp|\- apiVersion|ownerReferences|blockOwnerDeletion|controller|kind\: Network|name\: cluster|uid\:|resourceVersion" | sed 's/name: ip-reconciler/name: test-reconciler/' | sed '/            - -log-level=verbose/a \ \ \ \ \ \ \ \ \ \ \ \ - -timeout=invalid' > /tmp/reconcile.yml
[weliang@weliang ~]$ oc create -f /tmp/reconcile.yml
cronjob.batch/test-reconciler created
[weliang@weliang ~]$ oc create job --from=cronjob/test-reconciler -n openshift-multus testrun-ip-reconciler
job.batch/testrun-ip-reconciler created
[weliang@weliang ~]$ oc get pods | grep testrun
testrun-ip-reconciler-pmzs6           0/1     Error     0          6s
[weliang@weliang ~]$ oc logs testrun-ip-reconciler-pmzs6
invalid value "invalid" for flag -timeout: parse error
Usage of /ip-reconciler:
  -kubeconfig string
    	the path to the Kubernetes configuration file
  -log-level ip-reconciler
    	the logging level for the ip-reconciler app. Valid values are: "debug", "verbose", "error", and "panic". (default "error")
  -timeout int
    	the value for a request timeout in seconds. (default 30)
[weliang@weliang ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.10   True        False         18m     Cluster version is 4.10.10
[weliang@weliang ~]$

Comment 10 errata-xmlrpc 2022-04-21 13:16:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.10 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1356