Bug 2048575 - IP reconciler cron job failing on single node
Summary: IP reconciler cron job failing on single node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.11.0
Assignee: Nikhil Simha
QA Contact: Weibin Liang
URL:
Whiteboard: SingleNode
Depends On:
Blocks: 2051639 2054791
TreeView+ depends on / blocked
 
Reported: 2022-01-31 14:31 UTC by Paul Maidment
Modified: 2022-11-10 14:08 UTC (History)
29 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The ip reconciliation cronjob for Whereabouts IPAM CNI could fail due to api connectivity issues, or api server timeouts. Consequence: The cronjob could fail intermittently. In most cases this failure has no impact on customer clusters, and it will succeed on subsequent runs. Fix: Set the job to use the api-internal server address as well as extended the api timeouts. Result: Determined that this cronjob should only be launched on clusters that use Whereabouts IPAM CNI actively.
Clone Of:
: 2051639 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:45:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1302 0 None Merged Bug 2048575: The Whereabouts ip-reconciler should use api-int load balancer 2022-02-15 13:41:45 UTC
Red Hat Knowledge Base (Solution) 6869281 0 None None None 2022-03-31 14:02:37 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:46:22 UTC

Description Paul Maidment 2022-01-31 14:31:37 UTC
Description of problem:

This occurs on single node on AWS.
The Prow job fails because there is an alert due to the ip-resolver Job (as in k8s job) failing.

Version-Release number of selected component (if applicable): registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-01-31-102954

Please see the following Prow job for more information;
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25752/rehearse-25752-periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-with-workers/1487136907035938816

The failure occurred before any of our test code ran, indicating that the issue was during installation. 


How reproducible:

This is intermittent but frequent, the majority of single node Prow jobs fail.

Steps to Reproduce:
1. Run the e2e job for single node. 
2. Note that the job fails due to issues in the ip-resolver.
3. Grab the logs for the failed pod

Here is an example:

```
oc logs -n openshift-multus ip-reconciler-27389955-cvl7z 
/home/paulmaidment/scratch/1487136907035938816/artifacts/e2e-aws-single-node-with-workers/gather-must-gather/artifacts/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8bb2c196acb798b9b181a891e8e940d1ea2049f23b08ac003aba71bafc39f880/namespaces/openshift-multus/pods/ip-reconciler-27389955-cvl7z/whereabouts/whereabouts/logs/current.log
2022-01-28T19:17:11.687024928Z I0128 19:17:11.686880       1 request.go:655] Throttling request took 1.183501852s, request: GET:https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```

Actual results:
The IP reconciliation fails with the error 

"failed to retrieve all IP pools: context deadline exceeded"


Expected results:
The job should not fail.

Additional info:

Comment 1 Douglas Smith 2022-01-31 22:11:50 UTC
I was able to reproduce this on cluster-bot with `launch ci single-node` and letting the cronjob run (at least) once.

The logs I got were:

```
$oc logs ip-reconciler-27394425-wqwdj
I0131 21:47:10.708763       1 request.go:655] Throttling request took 1.181969455s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
2022-01-31T21:47:20Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-31T21:47:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```

However -- if I create a job from the cronjob manually, the jobs completes successfully, e.g.

```
$ oc create job --from=cronjob/ip-reconciler -n openshift-multus testrun-ip-reconciler
```

I see the job complete like so:

```
$ oc get pods | grep -iP "name|testrun"
NAME                                  READY   STATUS      RESTARTS   AGE
testrun-ip-reconciler-pwrmc           0/1     Completed   0          102s
```

This appears like an API connectivity issue at some point in the cluster lifecycle (notably: on installation, seems)

Comment 4 Ramon Gordillo 2022-02-10 10:26:48 UTC
Same here in a 3-node cluster in 4.9.18 after upgrade:

>oc logs ip-reconciler-27408120--1-69r4g -n openshift-multus

I0210 10:05:55.088061       1 request.go:655] Throttling request took 1.1858412s, request: GET:https://172.30.0.1:443/apis/submariner.io/v1?timeout=32s
I0210 10:06:05.283595       1 request.go:655] Throttling request took 11.381687945s, request: GET:https://172.30.0.1:443/apis/migration.k8s.io/v1alpha1?timeout=32s
I0210 10:06:15.480855       1 request.go:655] Throttling request took 7.59640097s, request: GET:https://172.30.0.1:443/apis/cdi.kubevirt.io/v1alpha1?timeout=32s
I0210 10:06:25.679363       1 request.go:655] Throttling request took 17.79476399s, request: GET:https://172.30.0.1:443/apis/apps.open-cluster-management.io/v1?timeout=32s
I0210 10:06:35.868398       1 request.go:655] Throttling request took 27.983689144s, request: GET:https://172.30.0.1:443/apis/v2v.kubevirt.io/v1beta1?timeout=32s
2022-02-10T10:06:41Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-02-10T10:06:41Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

Comment 6 Weibin Liang 2022-02-11 16:47:26 UTC
Tested and verified in AWS SNO cluster

[weliang@weliang openshift-tests-private]$ oc get clusterversion
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-arm64-2022-02-11-140421   True        False         28m     Cluster version is 4.11.0-0.nightly-arm64-2022-02-11-140421
[weliang@weliang openshift-tests-private]$ oc get cronjobs -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        14m             45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS              RESTARTS   AGE
ip-reconciler-27409965-m54mw          0/1     ContainerCreating   0          0s
multus-additional-cni-plugins-8p8vp   1/1     Running             0          45m
multus-admission-controller-c9xxc     2/2     Running             0          44m
multus-l9ht5                          1/1     Running             0          45m
network-metrics-daemon-9l4td          2/2     Running             0          45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-additional-cni-plugins-8p8vp   1/1     Running   0          45m
multus-admission-controller-c9xxc     2/2     Running   0          44m
multus-l9ht5                          1/1     Running   0          45m
network-metrics-daemon-9l4td          2/2     Running   0          45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-additional-cni-plugins-8p8vp   1/1     Running   0          45m
multus-admission-controller-c9xxc     2/2     Running   0          44m
multus-l9ht5                          1/1     Running   0          45m
network-metrics-daemon-9l4td          2/2     Running   0          45m
[weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
multus-additional-cni-plugins-8p8vp   1/1     Running   0          46m
multus-admission-controller-c9xxc     2/2     Running   0          44m
multus-l9ht5                          1/1     Running   0          46m
network-metrics-daemon-9l4td          2/2     Running   0          46m

Comment 8 Ray Giguere 2022-03-15 15:19:13 UTC
3 node bare metal private cluster

Server Version: 4.10.3
Kubernetes Version: v1.23.3+e419edf


I0314 23:45:12.983946 1 request.go:665] Waited for 11.360768514s due to client-side throttling, not priority and fairness, request: GET:https://api-int.ocp4.labnet.terra-trekker.com:6443/apis/controlplane.operator.openshift.io/v1alpha1?timeout=32s
2022-03-14T23:45:20Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-03-14T23:45:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

Comment 29 errata-xmlrpc 2022-08-10 10:45:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.