+++ This bug was initially created as a clone of Bug #2048575 +++ Description of problem: This occurs on single node on AWS. The Prow job fails because there is an alert due to the ip-resolver Job (as in k8s job) failing. Version-Release number of selected component (if applicable): registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-01-31-102954 Please see the following Prow job for more information; https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25752/rehearse-25752-periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-with-workers/1487136907035938816 The failure occurred before any of our test code ran, indicating that the issue was during installation. How reproducible: This is intermittent but frequent, the majority of single node Prow jobs fail. Steps to Reproduce: 1. Run the e2e job for single node. 2. Note that the job fails due to issues in the ip-resolver. 3. Grab the logs for the failed pod Here is an example: ``` oc logs -n openshift-multus ip-reconciler-27389955-cvl7z /home/paulmaidment/scratch/1487136907035938816/artifacts/e2e-aws-single-node-with-workers/gather-must-gather/artifacts/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8bb2c196acb798b9b181a891e8e940d1ea2049f23b08ac003aba71bafc39f880/namespaces/openshift-multus/pods/ip-reconciler-27389955-cvl7z/whereabouts/whereabouts/logs/current.log 2022-01-28T19:17:11.687024928Z I0128 19:17:11.686880 1 request.go:655] Throttling request took 1.183501852s, request: GET:https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s 2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded ``` Actual results: The IP reconciliation fails with the error "failed to retrieve all IP pools: context deadline exceeded" Expected results: The job should not fail. Additional info: --- Additional comment from Douglas Smith on 2022-01-31 22:11:50 UTC --- I was able to reproduce this on cluster-bot with `launch ci single-node` and letting the cronjob run (at least) once. The logs I got were: ``` $oc logs ip-reconciler-27394425-wqwdj I0131 21:47:10.708763 1 request.go:655] Throttling request took 1.181969455s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s 2022-01-31T21:47:20Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-01-31T21:47:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded ``` However -- if I create a job from the cronjob manually, the jobs completes successfully, e.g. ``` $ oc create job --from=cronjob/ip-reconciler -n openshift-multus testrun-ip-reconciler ``` I see the job complete like so: ``` $ oc get pods | grep -iP "name|testrun" NAME READY STATUS RESTARTS AGE testrun-ip-reconciler-pwrmc 0/1 Completed 0 102s ``` This appears like an API connectivity issue at some point in the cluster lifecycle (notably: on installation, seems)
On 4.10 CI payload runs, this issue has accounted for 50% of the rejected payloads (3 out of 6) in the past 24 hours. Out of all the upgrade jobs, this has failed 12% out of 2 day runs: https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired+for+.*ip-reconciler&maxAge=48h&context=1&type=junit&name=4.10.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
In addition to failing promotion jobs, this pod is firing an alert in 12% of our upgrade jobs for 4.10.0. In the PR that switches to the internal load balancer to avoid SDN setup latency, we see failures that indicate client-go needs to be updated to a level that does retries. 1.23 is suggested. In the PR that switches the internal load balancer, we should also update to use the hostnetwork if possible to avoid dependencies for this early component too.
This will require changes on two components: * CNO: Changes to use internal load balancer * Whereabouts: Update to client-go I'll start by using this BZ for the CNO side.
Bugfix included in accepted release 4.10.0-0.nightly-2022-02-11-123954, but testing failed in 4.10.0-0.nightly-arm64-2022-02-14-115915 [weliang@weliang OCP-45842]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-arm64-2022-02-14-115915 True False 26m Cluster version is 4.10.0-0.nightly-arm64-2022-02-14-115915 [weliang@weliang OCP-45842]$ oc get cronjobs -n openshift-multus NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ip-reconciler */15 * * * * False 0 5m57s 42m [weliang@weliang OCP-45842]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE ip-reconciler-27414180-l86vp 0/1 Error 0 35m multus-additional-cni-plugins-nhv7f 1/1 Running 0 43m multus-admission-controller-qgtbf 2/2 Running 0 41m multus-jl97q 1/1 Running 0 43m network-metrics-daemon-ls45k 2/2 Running 0 43m [weliang@weliang OCP-45842]$ oc logs ip-reconciler-27414180-l86vp Error from server (NotFound): pods "ip-reconciler-27414180-l86vp" not found [weliang@weliang OCP-45842]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE ip-reconciler-27414180-l86vp 0/1 Error 0 36m multus-additional-cni-plugins-nhv7f 1/1 Running 0 44m multus-admission-controller-qgtbf 2/2 Running 0 42m multus-jl97q 1/1 Running 0 44m network-metrics-daemon-ls45k 2/2 Running 0 44m [weliang@weliang OCP-45842]$ oc logs ip-reconciler-27414180-l86vp -n openshift-multus I0214 15:00:46.150657 1 request.go:665] Waited for 1.170199485s due to client-side throttling, not priority and fairness, request: GET:https://api-int.weliang-2142.qe.devcluster.openshift.com:6443/apis/operators.coreos.com/v1?timeout=32s 2022-02-14T15:00:55Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-02-14T15:00:55Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded [weliang@weliang OCP-45842]$
We have a WIP PR against master for the reconciler to exit 0 when it encounters known errors: https://github.com/openshift/whereabouts-cni/pull/84/files
This issue happened on my 4.9 SNO cluster as well, I am cloning this to 4.9.z. [bzhai@fedora dual-statck-ai]$ oc get pods|grep -vE "Running|Completed" NAME READY STATUS RESTARTS AGE ip-reconciler-27415725--1-4j87z 0/1 Error 0 9m14s ip-reconciler-27415725--1-9flcp 0/1 Error 0 11m ip-reconciler-27415725--1-hr2s4 0/1 Error 0 13m ip-reconciler-27415725--1-kfcbm 0/1 Error 0 13m ip-reconciler-27415725--1-stpq6 0/1 Error 0 16m ip-reconciler-27415725--1-tp72p 0/1 Error 0 14m ip-reconciler-27415725--1-w7flf 0/1 Error 0 15m ip-reconciler-27415740--1-jnvzl 0/1 Error 0 36s ip-reconciler-27415740--1-klxjr 0/1 Error 0 67s testrun-ip-reconciler--1-6fwjf 0/1 Error 0 72s testrun-ip-reconciler--1-ptx44 0/1 Error 0 40s
this issue happened also with one partner using 4.10.9 and baremetal. They tell us this was not happening with 4.10.5, but it is something we can not check. I am not sure if we have version where the bug is fixed.
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9110
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days