Description of problem: The cluster needed to drain and replace a node, which had the ip-reconciler job pod running on it. The drain was stalled with the ip-reconciler pod stuck in a Terminating state. The logs of the ip-reconciler pod seemed to indicate that it never successfully ran: --- 2022-03-17T21:16:18Z [error] failed to instantiate the Kubernetes client: Get "https://api-int.zen-aws.rqyj.p1.openshiftapps.com:6443/api?timeout=32s": dial tcp 10.0.203.87:6443: connect: connection refused 2022-03-17T21:16:18Z [error] failed to create the reconcile looper: failed to instantiate the Kubernetes client: Get "https://api-int.zen-aws.rqyj.p1.openshiftapps.com:6443/api?timeout=32s": dial tcp 10.0.203.87:6443: connect: connection refused --- Version-Release number of selected component (if applicable): 4.10.3 Actual results: Expected results: The ip-reconciler job should not result in the cluster getting into state that blocks the node drain. Additional info:
Thanks for the report. There's a couple parts for this, as we're in the midst of a backport of a number of fixes for this cronjob. I've gone ahead and add this to our known errors so that hopefully the process exists cleanly in the future. It's also been noted that some of the other pending backports may also aid in this situation as well. The PR to watch is: https://github.com/openshift/whereabouts-cni/pull/88 for 4.10.
Following verifying steps from https://gist.github.com/dougbtv/b84c5dec4953f4b85048d16ddcf72c15,testing pass in 4.10.10: [weliang@weliang ~]$ oc get cronjob ip-reconciler -o yaml | grep -vP "creationTimestamp|\- apiVersion|ownerReferences|blockOwnerDeletion|controller|kind\: Network|name\: cluster|uid\:|resourceVersion" | sed 's/name: ip-reconciler/name: test-reconciler/' | sed '/ - -log-level=verbose/a \ \ \ \ \ \ \ \ \ \ \ \ - -timeout=invalid' > /tmp/reconcile.yml Error from server (NotFound): cronjobs.batch "ip-reconciler" not found [weliang@weliang ~]$ oc project openshift-multus Now using project "openshift-multus" on server "https://api.weliang-4142.qe.gcp.devcluster.openshift.com:6443". [weliang@weliang ~]$ oc get cronjob ip-reconciler -o yaml | grep -vP "creationTimestamp|\- apiVersion|ownerReferences|blockOwnerDeletion|controller|kind\: Network|name\: cluster|uid\:|resourceVersion" | sed 's/name: ip-reconciler/name: test-reconciler/' | sed '/ - -log-level=verbose/a \ \ \ \ \ \ \ \ \ \ \ \ - -timeout=invalid' > /tmp/reconcile.yml [weliang@weliang ~]$ oc create -f /tmp/reconcile.yml cronjob.batch/test-reconciler created [weliang@weliang ~]$ oc create job --from=cronjob/test-reconciler -n openshift-multus testrun-ip-reconciler job.batch/testrun-ip-reconciler created [weliang@weliang ~]$ oc get pods | grep testrun testrun-ip-reconciler-pmzs6 0/1 Error 0 6s [weliang@weliang ~]$ oc logs testrun-ip-reconciler-pmzs6 invalid value "invalid" for flag -timeout: parse error Usage of /ip-reconciler: -kubeconfig string the path to the Kubernetes configuration file -log-level ip-reconciler the logging level for the ip-reconciler app. Valid values are: "debug", "verbose", "error", and "panic". (default "error") -timeout int the value for a request timeout in seconds. (default 30) [weliang@weliang ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.10 True False 18m Cluster version is 4.10.10 [weliang@weliang ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.10 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1356