Bug 2051639

Summary:	IP reconciler cron job failing on single node
Product:	OpenShift Container Platform	Reporter:	Ken Zhang <kenzhang>
Component:	Networking	Assignee:	Douglas Smith <dosmith>
Networking sub component:	multus	QA Contact:	Weibin Liang <weliang>
Status:	CLOSED DEFERRED	Docs Contact:
Severity:	high
Priority:	high	CC:	acapriot, anbhat, bzhai, deads, dgautam, dosmith, mduarted, mmarkand, musman, pmaidmen, satripat, swasthan, weliang
Version:	4.10
Target Milestone:	---
Target Release:	4.10.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	SingleNode
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	2048575
Clones:	2054791 (view as bug list)		Environment:
Last Closed:	2023-03-09 01:12:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2048575
Bug Blocks:	2054791

Description Ken Zhang 2022-02-07 17:06:18 UTC

+++ This bug was initially created as a clone of Bug #2048575 +++

Description of problem:

This occurs on single node on AWS.
The Prow job fails because there is an alert due to the ip-resolver Job (as in k8s job) failing.

Version-Release number of selected component (if applicable): registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-01-31-102954

Please see the following Prow job for more information;
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25752/rehearse-25752-periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-with-workers/1487136907035938816

The failure occurred before any of our test code ran, indicating that the issue was during installation. 


How reproducible:

This is intermittent but frequent, the majority of single node Prow jobs fail.

Steps to Reproduce:
1. Run the e2e job for single node. 
2. Note that the job fails due to issues in the ip-resolver.
3. Grab the logs for the failed pod

Here is an example:

```
oc logs -n openshift-multus ip-reconciler-27389955-cvl7z 
/home/paulmaidment/scratch/1487136907035938816/artifacts/e2e-aws-single-node-with-workers/gather-must-gather/artifacts/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8bb2c196acb798b9b181a891e8e940d1ea2049f23b08ac003aba71bafc39f880/namespaces/openshift-multus/pods/ip-reconciler-27389955-cvl7z/whereabouts/whereabouts/logs/current.log
2022-01-28T19:17:11.687024928Z I0128 19:17:11.686880       1 request.go:655] Throttling request took 1.183501852s, request: GET:https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```

Actual results:
The IP reconciliation fails with the error 

"failed to retrieve all IP pools: context deadline exceeded"


Expected results:
The job should not fail.

Additional info:

--- Additional comment from Douglas Smith on 2022-01-31 22:11:50 UTC ---

I was able to reproduce this on cluster-bot with `launch ci single-node` and letting the cronjob run (at least) once.

The logs I got were:

```
$oc logs ip-reconciler-27394425-wqwdj
I0131 21:47:10.708763       1 request.go:655] Throttling request took 1.181969455s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
2022-01-31T21:47:20Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-01-31T21:47:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
```

However -- if I create a job from the cronjob manually, the jobs completes successfully, e.g.

```
$ oc create job --from=cronjob/ip-reconciler -n openshift-multus testrun-ip-reconciler
```

I see the job complete like so:

```
$ oc get pods | grep -iP "name|testrun"
NAME                                  READY   STATUS      RESTARTS   AGE
testrun-ip-reconciler-pwrmc           0/1     Completed   0          102s
```

This appears like an API connectivity issue at some point in the cluster lifecycle (notably: on installation, seems)

Comment 1 Ken Zhang 2022-02-07 17:10:47 UTC

On 4.10 CI payload runs, this issue has accounted for 50% of the rejected payloads (3 out of 6) in the past 24 hours. 

Out of all the upgrade jobs, this has failed 12% out of 2 day runs: https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired+for+.*ip-reconciler&maxAge=48h&context=1&type=junit&name=4.10.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 2 David Eads 2022-02-07 18:06:56 UTC

In addition to failing promotion jobs, this pod is firing an alert in 12% of our upgrade jobs for 4.10.0.   In the PR that switches to the internal load balancer to avoid SDN setup latency, we see failures that indicate client-go needs to be updated to a level that does retries.  1.23 is suggested.

In the PR that switches the internal load balancer, we should also update to use the hostnetwork if possible to avoid dependencies for this early component too.

Comment 3 Douglas Smith 2022-02-07 20:23:48 UTC

This will require changes on two components:

* CNO: Changes to use internal load balancer
* Whereabouts: Update to client-go

I'll start by using this BZ for the CNO side.

Comment 7 Weibin Liang 2022-02-14 15:40:12 UTC

Bugfix included in accepted release 4.10.0-0.nightly-2022-02-11-123954, but testing failed in 4.10.0-0.nightly-arm64-2022-02-14-115915

[weliang@weliang OCP-45842]$ oc get clusterversion
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-arm64-2022-02-14-115915   True        False         26m     Cluster version is 4.10.0-0.nightly-arm64-2022-02-14-115915
[weliang@weliang OCP-45842]$ oc get cronjobs -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        5m57s           42m
[weliang@weliang OCP-45842]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
ip-reconciler-27414180-l86vp          0/1     Error     0          35m
multus-additional-cni-plugins-nhv7f   1/1     Running   0          43m
multus-admission-controller-qgtbf     2/2     Running   0          41m
multus-jl97q                          1/1     Running   0          43m
network-metrics-daemon-ls45k          2/2     Running   0          43m
[weliang@weliang OCP-45842]$ oc logs ip-reconciler-27414180-l86vp
Error from server (NotFound): pods "ip-reconciler-27414180-l86vp" not found
[weliang@weliang OCP-45842]$ oc get pods -n openshift-multus
NAME                                  READY   STATUS    RESTARTS   AGE
ip-reconciler-27414180-l86vp          0/1     Error     0          36m
multus-additional-cni-plugins-nhv7f   1/1     Running   0          44m
multus-admission-controller-qgtbf     2/2     Running   0          42m
multus-jl97q                          1/1     Running   0          44m
network-metrics-daemon-ls45k          2/2     Running   0          44m

[weliang@weliang OCP-45842]$ oc logs ip-reconciler-27414180-l86vp -n openshift-multus
I0214 15:00:46.150657       1 request.go:665] Waited for 1.170199485s due to client-side throttling, not priority and fairness, request: GET:https://api-int.weliang-2142.qe.devcluster.openshift.com:6443/apis/operators.coreos.com/v1?timeout=32s
2022-02-14T15:00:55Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-02-14T15:00:55Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
[weliang@weliang OCP-45842]$

Comment 8 Douglas Smith 2022-02-14 19:23:13 UTC

We have a WIP PR against master for the reconciler to exit 0 when it encounters known errors: https://github.com/openshift/whereabouts-cni/pull/84/files

Comment 10 bzhai 2022-02-15 17:26:23 UTC

This issue happened on my 4.9 SNO cluster as well, I am cloning this to 4.9.z. 

[bzhai@fedora dual-statck-ai]$ oc get pods|grep -vE "Running|Completed"
NAME                                  READY   STATUS    RESTARTS   AGE
ip-reconciler-27415725--1-4j87z       0/1     Error     0          9m14s
ip-reconciler-27415725--1-9flcp       0/1     Error     0          11m
ip-reconciler-27415725--1-hr2s4       0/1     Error     0          13m
ip-reconciler-27415725--1-kfcbm       0/1     Error     0          13m
ip-reconciler-27415725--1-stpq6       0/1     Error     0          16m
ip-reconciler-27415725--1-tp72p       0/1     Error     0          14m
ip-reconciler-27415725--1-w7flf       0/1     Error     0          15m
ip-reconciler-27415740--1-jnvzl       0/1     Error     0          36s
ip-reconciler-27415740--1-klxjr       0/1     Error     0          67s
testrun-ip-reconciler--1-6fwjf        0/1     Error     0          72s
testrun-ip-reconciler--1-ptx44        0/1     Error     0          40s

Comment 17 Jose Gato 2022-04-25 13:02:50 UTC

this issue happened also with one partner using 4.10.9 and baremetal. They tell us this was not happening with 4.10.5, but it is something we can not check. I am not sure if we have version where the bug is fixed.

Comment 21 Shiftzilla 2023-03-09 01:12:33 UTC

OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9110

Comment 22 Red Hat Bugzilla 2023-09-18 04:31:41 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days