2048575 – IP reconciler cron job failing on single node

Bug 2048575 - IP reconciler cron job failing on single node

Summary: IP reconciler cron job failing on single node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Nikhil Simha
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:	SingleNode
Depends On:
Blocks:	2051639 2054791
TreeView+	depends on / blocked

Reported:	2022-01-31 14:31 UTC by Paul Maidment
Modified:	2022-11-10 14:08 UTC (History)
CC List:	29 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The ip reconciliation cronjob for Whereabouts IPAM CNI could fail due to api connectivity issues, or api server timeouts. Consequence: The cronjob could fail intermittently. In most cases this failure has no impact on customer clusters, and it will succeed on subsequent runs. Fix: Set the job to use the api-internal server address as well as extended the api timeouts. Result: Determined that this cronjob should only be launched on clusters that use Whereabouts IPAM CNI actively.
Clone Of:
Clones:	2051639 (view as bug list)
Environment:
Last Closed:	2022-08-10 10:45:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 1302	None	Merged	Bug 2048575: The Whereabouts ip-reconciler should use api-int load balancer	2022-02-15 13:41:45 UTC
Red Hat Knowledge Base (Solution)	6869281	None	None	None	2022-03-31 14:02:37 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:46:22 UTC

Description Paul Maidment 2022-01-31 14:31:37 UTC

Description of problem:

This occurs on single node on AWS.
The Prow job fails because there is an alert due to the ip-resolver Job (as in k8s job) failing.

Version-Release number of selected component (if applicable): registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-01-31-102954

Please see the following Prow job for more information;
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25752/rehearse-25752-periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-with-workers/1487136907035938816

The failure occurred before any of our test code ran, indicating that the issue was during installation. 


How reproducible:

This is intermittent but frequent, the majority of single node Prow jobs fail.

Steps to Reproduce:
1. Run the e2e job for single node. 
2. Note that the job fails due to issues in the ip-resolver.
3. Grab the logs for the failed pod

Here is an example:

```
oc logs -n openshift-multus ip-reconciler-27389955-cvl7z 
/home/paulmaidment/scratch/1487136907035938816/artifacts/e2e-aws-single-node-with-workers/gather-must-gather/artifacts/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8bb2c196acb798b9b181a891e8e940d1ea2049f23b08ac003aba71bafc39f880/namespaces/openshift-multus/pods/ip-reconciler-27389955-cvl7z/whereabouts/whereabouts/logs/current.log
2022-01-28T19:17:11.687024928Z I0128 19:17:11.686880       1 request.go:655] Throttling request took 1.183501852s, request: GET:https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```

Actual results:
The IP reconciliation fails with the error 

"failed to retrieve all IP pools: context deadline exceeded"


Expected results:
The job should not fail.

Additional info:

Comment 1 Douglas Smith 2022-01-31 22:11:50 UTC

I was able to reproduce this on cluster-bot with `launch ci single-node` and letting the cronjob run (at least) once.

The logs I got were:

```
$oc logs ip-reconciler-27394425-wqwdj
I0131 21:47:10.708763       1 request.go:655] Throttling request took 1.181969455s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
2022-01-31T21:47:20Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-31T21:47:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```

However -- if I create a job from the cronjob manually, the jobs completes successfully, e.g.

```
$ oc create job --from=cronjob/ip-reconciler -n openshift-multus testrun-ip-reconciler
```

I see the job complete like so:

```
$ oc get pods | grep -iP "name|testrun"
NAME                                  READY   STATUS      RESTARTS   AGE
testrun-ip-reconciler-pwrmc           0/1     Completed   0          102s
```

This appears like an API connectivity issue at some point in the cluster lifecycle (notably: on installation, seems)

Comment 4 Ramon Gordillo 2022-02-10 10:26:48 UTC

Same here in a 3-node cluster in 4.9.18 after upgrade:

>oc logs ip-reconciler-27408120--1-69r4g -n openshift-multus

I0210 10:05:55.088061       1 request.go:655] Throttling request took 1.1858412s, request: GET:https://172.30.0.1:443/apis/submariner.io/v1?timeout=32s
I0210 10:06:05.283595       1 request.go:655] Throttling request took 11.381687945s, request: GET:https://172.30.0.1:443/apis/migration.k8s.io/v1alpha1?timeout=32s
I0210 10:06:15.480855       1 request.go:655] Throttling request took 7.59640097s, request: GET:https://172.30.0.1:443/apis/cdi.kubevirt.io/v1alpha1?timeout=32s
I0210 10:06:25.679363       1 request.go:655] Throttling request took 17.79476399s, request: GET:https://172.30.0.1:443/apis/apps.open-cluster-management.io/v1?timeout=32s
I0210 10:06:35.868398       1 request.go:655] Throttling request took 27.983689144s, request: GET:https://172.30.0.1:443/apis/v2v.kubevirt.io/v1beta1?timeout=32s
2022-02-10T10:06:41Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-02-10T10:06:41Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

Comment 6 Weibin Liang 2022-02-11 16:47:26 UTC

Tested and verified in AWS SNO cluster

[weliang@weliang openshift-tests-private]$ oc get clusterversion
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-arm64-2022-02-11-140421   True        False         28m     Cluster version is 4.11.0-0.nightly-arm64-2022-02-11-140421
[weliang@weliang openshift-tests-private]$ oc get cronjobs -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        14m             45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS              RESTARTS   AGE
ip-reconciler-27409965-m54mw          0/1     ContainerCreating   0          0s
multus-additional-cni-plugins-8p8vp   1/1     Running             0          45m
multus-admission-controller-c9xxc     2/2     Running             0          44m
multus-l9ht5                          1/1     Running             0          45m
network-metrics-daemon-9l4td          2/2     Running             0          45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-additional-cni-plugins-8p8vp   1/1     Running   0          45m
multus-admission-controller-c9xxc     2/2     Running   0          44m
multus-l9ht5                          1/1     Running   0          45m
network-metrics-daemon-9l4td          2/2     Running   0          45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-additional-cni-plugins-8p8vp   1/1     Running   0          45m
multus-admission-controller-c9xxc     2/2     Running   0          44m
multus-l9ht5                          1/1     Running   0          45m
network-metrics-daemon-9l4td          2/2     Running   0          45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-additional-cni-plugins-8p8vp   1/1     Running   0          46m
multus-admission-controller-c9xxc     2/2     Running   0          44m
multus-l9ht5                          1/1     Running   0          46m
network-metrics-daemon-9l4td          2/2     Running   0          46m

Comment 8 Ray Giguere 2022-03-15 15:19:13 UTC

3 node bare metal private cluster

Server Version: 4.10.3
Kubernetes Version: v1.23.3+e419edf


I0314 23:45:12.983946 1 request.go:665] Waited for 11.360768514s due to client-side throttling, not priority and fairness, request: GET:https://api-int.ocp4.labnet.terra-trekker.com:6443/apis/controlplane.operator.openshift.io/v1alpha1?timeout=32s
2022-03-14T23:45:20Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-03-14T23:45:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

Comment 29 errata-xmlrpc 2022-08-10 10:45:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.

acapriot
adjire
anbhat
bfurtado
calfonso
dgautam
dosmith
etroitskiy
evadla
jgato
jocolema
leandro.rebosio
mduarted
mrobson
mtarsel
musman
ngirard
nsimha
ocasalsa
pmaidmen
pupadhya
ramon.gordillo
rgiguere
rhowe
rpalstra
satripat
sbai
sninganu
swasthan