Description of problem: This occurs on single node on AWS. The Prow job fails because there is an alert due to the ip-resolver Job (as in k8s job) failing. Version-Release number of selected component (if applicable): registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-01-31-102954 Please see the following Prow job for more information; https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25752/rehearse-25752-periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-with-workers/1487136907035938816 The failure occurred before any of our test code ran, indicating that the issue was during installation. How reproducible: This is intermittent but frequent, the majority of single node Prow jobs fail. Steps to Reproduce: 1. Run the e2e job for single node. 2. Note that the job fails due to issues in the ip-resolver. 3. Grab the logs for the failed pod Here is an example: ``` oc logs -n openshift-multus ip-reconciler-27389955-cvl7z /home/paulmaidment/scratch/1487136907035938816/artifacts/e2e-aws-single-node-with-workers/gather-must-gather/artifacts/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8bb2c196acb798b9b181a891e8e940d1ea2049f23b08ac003aba71bafc39f880/namespaces/openshift-multus/pods/ip-reconciler-27389955-cvl7z/whereabouts/whereabouts/logs/current.log 2022-01-28T19:17:11.687024928Z I0128 19:17:11.686880 1 request.go:655] Throttling request took 1.183501852s, request: GET:https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s 2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-01-28T19:17:22.340310363Z 2022-01-28T19:17:22Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded ``` Actual results: The IP reconciliation fails with the error "failed to retrieve all IP pools: context deadline exceeded" Expected results: The job should not fail. Additional info:
I was able to reproduce this on cluster-bot with `launch ci single-node` and letting the cronjob run (at least) once. The logs I got were: ``` $oc logs ip-reconciler-27394425-wqwdj I0131 21:47:10.708763 1 request.go:655] Throttling request took 1.181969455s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s 2022-01-31T21:47:20Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-01-31T21:47:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded ``` However -- if I create a job from the cronjob manually, the jobs completes successfully, e.g. ``` $ oc create job --from=cronjob/ip-reconciler -n openshift-multus testrun-ip-reconciler ``` I see the job complete like so: ``` $ oc get pods | grep -iP "name|testrun" NAME READY STATUS RESTARTS AGE testrun-ip-reconciler-pwrmc 0/1 Completed 0 102s ``` This appears like an API connectivity issue at some point in the cluster lifecycle (notably: on installation, seems)
Same here in a 3-node cluster in 4.9.18 after upgrade: >oc logs ip-reconciler-27408120--1-69r4g -n openshift-multus I0210 10:05:55.088061 1 request.go:655] Throttling request took 1.1858412s, request: GET:https://172.30.0.1:443/apis/submariner.io/v1?timeout=32s I0210 10:06:05.283595 1 request.go:655] Throttling request took 11.381687945s, request: GET:https://172.30.0.1:443/apis/migration.k8s.io/v1alpha1?timeout=32s I0210 10:06:15.480855 1 request.go:655] Throttling request took 7.59640097s, request: GET:https://172.30.0.1:443/apis/cdi.kubevirt.io/v1alpha1?timeout=32s I0210 10:06:25.679363 1 request.go:655] Throttling request took 17.79476399s, request: GET:https://172.30.0.1:443/apis/apps.open-cluster-management.io/v1?timeout=32s I0210 10:06:35.868398 1 request.go:655] Throttling request took 27.983689144s, request: GET:https://172.30.0.1:443/apis/v2v.kubevirt.io/v1beta1?timeout=32s 2022-02-10T10:06:41Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-02-10T10:06:41Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
Tested and verified in AWS SNO cluster [weliang@weliang openshift-tests-private]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-arm64-2022-02-11-140421 True False 28m Cluster version is 4.11.0-0.nightly-arm64-2022-02-11-140421 [weliang@weliang openshift-tests-private]$ oc get cronjobs -n openshift-multus NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ip-reconciler */15 * * * * False 0 14m 45m [weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE ip-reconciler-27409965-m54mw 0/1 ContainerCreating 0 0s multus-additional-cni-plugins-8p8vp 1/1 Running 0 45m multus-admission-controller-c9xxc 2/2 Running 0 44m multus-l9ht5 1/1 Running 0 45m network-metrics-daemon-9l4td 2/2 Running 0 45m [weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE multus-additional-cni-plugins-8p8vp 1/1 Running 0 45m multus-admission-controller-c9xxc 2/2 Running 0 44m multus-l9ht5 1/1 Running 0 45m network-metrics-daemon-9l4td 2/2 Running 0 45m [weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE multus-additional-cni-plugins-8p8vp 1/1 Running 0 45m multus-admission-controller-c9xxc 2/2 Running 0 44m multus-l9ht5 1/1 Running 0 45m network-metrics-daemon-9l4td 2/2 Running 0 45m [weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE multus-additional-cni-plugins-8p8vp 1/1 Running 0 46m multus-admission-controller-c9xxc 2/2 Running 0 44m multus-l9ht5 1/1 Running 0 46m network-metrics-daemon-9l4td 2/2 Running 0 46m
3 node bare metal private cluster Server Version: 4.10.3 Kubernetes Version: v1.23.3+e419edf I0314 23:45:12.983946 1 request.go:665] Waited for 11.360768514s due to client-side throttling, not priority and fairness, request: GET:https://api-int.ocp4.labnet.terra-trekker.com:6443/apis/controlplane.operator.openshift.io/v1alpha1?timeout=32s 2022-03-14T23:45:20Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-03-14T23:45:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069