Bug 2048575
Summary: | IP reconciler cron job failing on single node | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Paul Maidment <pmaidmen> | |
Component: | Networking | Assignee: | Nikhil Simha <nsimha> | |
Networking sub component: | multus | QA Contact: | Weibin Liang <weliang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | acapriot, adjire, anbhat, bfurtado, calfonso, dgautam, dosmith, etroitskiy, evadla, jgato, jocolema, leandro.rebosio, mduarted, mrobson, mtarsel, musman, ngirard, nsimha, ocasalsa, pmaidmen, pupadhya, ramon.gordillo, rgiguere, rhowe, rpalstra, satripat, sbai, sninganu, swasthan | |
Version: | 4.10 | |||
Target Milestone: | --- | |||
Target Release: | 4.11.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | SingleNode | |||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: The ip reconciliation cronjob for Whereabouts IPAM CNI could fail due to api connectivity issues, or api server timeouts.
Consequence: The cronjob could fail intermittently. In most cases this failure has no impact on customer clusters, and it will succeed on subsequent runs.
Fix: Set the job to use the api-internal server address as well as extended the api timeouts.
Result: Determined that this cronjob should only be launched on clusters that use Whereabouts IPAM CNI actively.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2051639 (view as bug list) | Environment: | ||
Last Closed: | 2022-08-10 10:45:53 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2051639, 2054791 |
Description
Paul Maidment
2022-01-31 14:31:37 UTC
I was able to reproduce this on cluster-bot with `launch ci single-node` and letting the cronjob run (at least) once. The logs I got were: ``` $oc logs ip-reconciler-27394425-wqwdj I0131 21:47:10.708763 1 request.go:655] Throttling request took 1.181969455s, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s 2022-01-31T21:47:20Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-01-31T21:47:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded ``` However -- if I create a job from the cronjob manually, the jobs completes successfully, e.g. ``` $ oc create job --from=cronjob/ip-reconciler -n openshift-multus testrun-ip-reconciler ``` I see the job complete like so: ``` $ oc get pods | grep -iP "name|testrun" NAME READY STATUS RESTARTS AGE testrun-ip-reconciler-pwrmc 0/1 Completed 0 102s ``` This appears like an API connectivity issue at some point in the cluster lifecycle (notably: on installation, seems) Same here in a 3-node cluster in 4.9.18 after upgrade: >oc logs ip-reconciler-27408120--1-69r4g -n openshift-multus I0210 10:05:55.088061 1 request.go:655] Throttling request took 1.1858412s, request: GET:https://172.30.0.1:443/apis/submariner.io/v1?timeout=32s I0210 10:06:05.283595 1 request.go:655] Throttling request took 11.381687945s, request: GET:https://172.30.0.1:443/apis/migration.k8s.io/v1alpha1?timeout=32s I0210 10:06:15.480855 1 request.go:655] Throttling request took 7.59640097s, request: GET:https://172.30.0.1:443/apis/cdi.kubevirt.io/v1alpha1?timeout=32s I0210 10:06:25.679363 1 request.go:655] Throttling request took 17.79476399s, request: GET:https://172.30.0.1:443/apis/apps.open-cluster-management.io/v1?timeout=32s I0210 10:06:35.868398 1 request.go:655] Throttling request took 27.983689144s, request: GET:https://172.30.0.1:443/apis/v2v.kubevirt.io/v1beta1?timeout=32s 2022-02-10T10:06:41Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-02-10T10:06:41Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded Tested and verified in AWS SNO cluster [weliang@weliang openshift-tests-private]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-arm64-2022-02-11-140421 True False 28m Cluster version is 4.11.0-0.nightly-arm64-2022-02-11-140421 [weliang@weliang openshift-tests-private]$ oc get cronjobs -n openshift-multus NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ip-reconciler */15 * * * * False 0 14m 45m [weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE ip-reconciler-27409965-m54mw 0/1 ContainerCreating 0 0s multus-additional-cni-plugins-8p8vp 1/1 Running 0 45m multus-admission-controller-c9xxc 2/2 Running 0 44m multus-l9ht5 1/1 Running 0 45m network-metrics-daemon-9l4td 2/2 Running 0 45m [weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE multus-additional-cni-plugins-8p8vp 1/1 Running 0 45m multus-admission-controller-c9xxc 2/2 Running 0 44m multus-l9ht5 1/1 Running 0 45m network-metrics-daemon-9l4td 2/2 Running 0 45m [weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE multus-additional-cni-plugins-8p8vp 1/1 Running 0 45m multus-admission-controller-c9xxc 2/2 Running 0 44m multus-l9ht5 1/1 Running 0 45m network-metrics-daemon-9l4td 2/2 Running 0 45m [weliang@weliang openshift-tests-private]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE multus-additional-cni-plugins-8p8vp 1/1 Running 0 46m multus-admission-controller-c9xxc 2/2 Running 0 44m multus-l9ht5 1/1 Running 0 46m network-metrics-daemon-9l4td 2/2 Running 0 46m 3 node bare metal private cluster Server Version: 4.10.3 Kubernetes Version: v1.23.3+e419edf I0314 23:45:12.983946 1 request.go:665] Waited for 11.360768514s due to client-side throttling, not priority and fairness, request: GET:https://api-int.ocp4.labnet.terra-trekker.com:6443/apis/controlplane.operator.openshift.io/v1alpha1?timeout=32s 2022-03-14T23:45:20Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-03-14T23:45:20Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |