Bug 2048575

Summary:	IP reconciler cron job failing on single node
Product:	OpenShift Container Platform	Reporter:	Paul Maidment <pmaidmen>
Component:	Networking	Assignee:	Nikhil Simha <nsimha>
Networking sub component:	multus	QA Contact:	Weibin Liang <weliang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	acapriot, adjire, anbhat, bfurtado, calfonso, dgautam, dosmith, etroitskiy, evadla, jgato, jocolema, leandro.rebosio, mduarted, mrobson, mtarsel, musman, ngirard, nsimha, ocasalsa, pmaidmen, pupadhya, ramon.gordillo, rgiguere, rhowe, rpalstra, satripat, sbai, sninganu, swasthan
Version:	4.10
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	SingleNode
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The ip reconciliation cronjob for Whereabouts IPAM CNI could fail due to api connectivity issues, or api server timeouts. Consequence: The cronjob could fail intermittently. In most cases this failure has no impact on customer clusters, and it will succeed on subsequent runs. Fix: Set the job to use the api-internal server address as well as extended the api timeouts. Result: Determined that this cronjob should only be launched on clusters that use Whereabouts IPAM CNI actively.	Story Points:	---
Clone Of:
Clones:	2051639 (view as bug list)		Environment:
Last Closed:	2022-08-10 10:45:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2051639, 2054791

Description Paul Maidment 2022-01-31 14:31:37 UTC

Description of problem:

This occurs on single node on AWS.
The Prow job fails because there is an alert due to the ip-resolver Job (as in k8s job) failing.

Version-Release number of selected component (if applicable): registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-01-31-102954

Please see the following Prow job for more information;
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25752/rehearse-25752-periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-with-workers/1487136907035938816

The failure occurred before any of our test code ran, indicating that the issue was during installation. 


How reproducible:

This is intermittent but frequent, the majority of single node Prow jobs fail.

Steps to Reproduce:
1. Run the e2e job for single node. 
2. Note that the job fails due to issues in the ip-resolver.
3. Grab the logs for the failed pod

Here is an example:

```
oc logs -n openshift-multus ip-reconciler-27389955-cvl7z 
/home/paulmaidment/scratch/1487136907035938816/artifacts/e2e-aws-single-node-with-workers/gather-must-gather/artifacts/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8bb2c196acb798b9b181a891e8e940d1ea2049f23b08ac003aba71bafc39f880/namespaces/openshift-multus/pods/ip-reconciler-27389955-cvl7z/whereabouts/whereabouts/logs/current.log
2022-01-28T19:17:11.687024928Z I0128 19:17:11.686880       1 request.go:655] Throttling request took 1.183501852s, request: GET:https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```

Actual results:
The IP reconciliation fails with the error 

"failed to retrieve all IP pools: context deadline exceeded"


Expected results:
The job should not fail.

Additional info:

Comment 1 Douglas Smith 2022-01-31 22:11:50 UTC

I was able to reproduce this on cluster-bot with `launch ci single-node` and letting the cronjob run (at least) once.

The logs I got were:

```
$oc logs ip-reconciler-27394425-wqwdj
I0131 21:47:10.708763       1 request.go:655] Throttling request took 1.181969455s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
2022-01-31T21:47:20Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-31T21:47:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```

However -- if I create a job from the cronjob manually, the jobs completes successfully, e.g.

```
$ oc create job --from=cronjob/ip-reconciler -n openshift-multus testrun-ip-reconciler
```

I see the job complete like so:

```
$ oc get pods | grep -iP "name|testrun"
NAME                                  READY   STATUS      RESTARTS   AGE
testrun-ip-reconciler-pwrmc           0/1     Completed   0          102s
```

This appears like an API connectivity issue at some point in the cluster lifecycle (notably: on installation, seems)

Comment 4 Ramon Gordillo 2022-02-10 10:26:48 UTC

Same here in a 3-node cluster in 4.9.18 after upgrade:

>oc logs ip-reconciler-27408120--1-69r4g -n openshift-multus

I0210 10:05:55.088061       1 request.go:655] Throttling request took 1.1858412s, request: GET:https://172.30.0.1:443/apis/submariner.io/v1?timeout=32s
I0210 10:06:05.283595       1 request.go:655] Throttling request took 11.381687945s, request: GET:https://172.30.0.1:443/apis/migration.k8s.io/v1alpha1?timeout=32s
I0210 10:06:15.480855       1 request.go:655] Throttling request took 7.59640097s, request: GET:https://172.30.0.1:443/apis/cdi.kubevirt.io/v1alpha1?timeout=32s
I0210 10:06:25.679363       1 request.go:655] Throttling request took 17.79476399s, request: GET:https://172.30.0.1:443/apis/apps.open-cluster-management.io/v1?timeout=32s
I0210 10:06:35.868398       1 request.go:655] Throttling request took 27.983689144s, request: GET:https://172.30.0.1:443/apis/v2v.kubevirt.io/v1beta1?timeout=32s
2022-02-10T10:06:41Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-02-10T10:06:41Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

Comment 6 Weibin Liang 2022-02-11 16:47:26 UTC

Tested and verified in AWS SNO cluster

[weliang@weliang openshift-tests-private]$ oc get clusterversion
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-arm64-2022-02-11-140421   True        False         28m     Cluster version is 4.11.0-0.nightly-arm64-2022-02-11-140421
[weliang@weliang openshift-tests-private]$ oc get cronjobs -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        14m             45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS              RESTARTS   AGE
ip-reconciler-27409965-m54mw          0/1     ContainerCreating   0          0s
multus-additional-cni-plugins-8p8vp   1/1     Running             0          45m
multus-admission-controller-c9xxc     2/2     Running             0          44m
multus-l9ht5                          1/1     Running             0          45m
network-metrics-daemon-9l4td          2/2     Running             0          45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-additional-cni-plugins-8p8vp   1/1     Running   0          45m
multus-admission-controller-c9xxc     2/2     Running   0          44m
multus-l9ht5                          1/1     Running   0          45m
network-metrics-daemon-9l4td          2/2     Running   0          45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-additional-cni-plugins-8p8vp   1/1     Running   0          45m
multus-admission-controller-c9xxc     2/2     Running   0          44m
multus-l9ht5                          1/1     Running   0          45m
network-metrics-daemon-9l4td          2/2     Running   0          45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-additional-cni-plugins-8p8vp   1/1     Running   0          46m
multus-admission-controller-c9xxc     2/2     Running   0          44m
multus-l9ht5                          1/1     Running   0          46m
network-metrics-daemon-9l4td          2/2     Running   0          46m

Comment 8 Ray Giguere 2022-03-15 15:19:13 UTC

3 node bare metal private cluster

Server Version: 4.10.3
Kubernetes Version: v1.23.3+e419edf


I0314 23:45:12.983946 1 request.go:665] Waited for 11.360768514s due to client-side throttling, not priority and fairness, request: GET:https://api-int.ocp4.labnet.terra-trekker.com:6443/apis/controlplane.operator.openshift.io/v1alpha1?timeout=32s
2022-03-14T23:45:20Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-03-14T23:45:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

Comment 29 errata-xmlrpc 2022-08-10 10:45:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069