Bug 2051639
Summary: | IP reconciler cron job failing on single node | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ken Zhang <kenzhang> | |
Component: | Networking | Assignee: | Douglas Smith <dosmith> | |
Networking sub component: | multus | QA Contact: | Weibin Liang <weliang> | |
Status: | CLOSED DEFERRED | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | acapriot, anbhat, bzhai, deads, dgautam, dosmith, mduarted, mmarkand, musman, pmaidmen, satripat, swasthan, weliang | |
Version: | 4.10 | |||
Target Milestone: | --- | |||
Target Release: | 4.10.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | SingleNode | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | 2048575 | |||
: | 2054791 (view as bug list) | Environment: | ||
Last Closed: | 2023-03-09 01:12:33 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2048575 | |||
Bug Blocks: | 2054791 |
Description
Ken Zhang
2022-02-07 17:06:18 UTC
On 4.10 CI payload runs, this issue has accounted for 50% of the rejected payloads (3 out of 6) in the past 24 hours. Out of all the upgrade jobs, this has failed 12% out of 2 day runs: https://search.ci.openshift.org/?search=alert+KubeJobFailed+fired+for+.*ip-reconciler&maxAge=48h&context=1&type=junit&name=4.10.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job In addition to failing promotion jobs, this pod is firing an alert in 12% of our upgrade jobs for 4.10.0. In the PR that switches to the internal load balancer to avoid SDN setup latency, we see failures that indicate client-go needs to be updated to a level that does retries. 1.23 is suggested. In the PR that switches the internal load balancer, we should also update to use the hostnetwork if possible to avoid dependencies for this early component too. This will require changes on two components: * CNO: Changes to use internal load balancer * Whereabouts: Update to client-go I'll start by using this BZ for the CNO side. Bugfix included in accepted release 4.10.0-0.nightly-2022-02-11-123954, but testing failed in 4.10.0-0.nightly-arm64-2022-02-14-115915 [weliang@weliang OCP-45842]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-arm64-2022-02-14-115915 True False 26m Cluster version is 4.10.0-0.nightly-arm64-2022-02-14-115915 [weliang@weliang OCP-45842]$ oc get cronjobs -n openshift-multus NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ip-reconciler */15 * * * * False 0 5m57s 42m [weliang@weliang OCP-45842]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE ip-reconciler-27414180-l86vp 0/1 Error 0 35m multus-additional-cni-plugins-nhv7f 1/1 Running 0 43m multus-admission-controller-qgtbf 2/2 Running 0 41m multus-jl97q 1/1 Running 0 43m network-metrics-daemon-ls45k 2/2 Running 0 43m [weliang@weliang OCP-45842]$ oc logs ip-reconciler-27414180-l86vp Error from server (NotFound): pods "ip-reconciler-27414180-l86vp" not found [weliang@weliang OCP-45842]$ oc get pods -n openshift-multus NAME READY STATUS RESTARTS AGE ip-reconciler-27414180-l86vp 0/1 Error 0 36m multus-additional-cni-plugins-nhv7f 1/1 Running 0 44m multus-admission-controller-qgtbf 2/2 Running 0 42m multus-jl97q 1/1 Running 0 44m network-metrics-daemon-ls45k 2/2 Running 0 44m [weliang@weliang OCP-45842]$ oc logs ip-reconciler-27414180-l86vp -n openshift-multus I0214 15:00:46.150657 1 request.go:665] Waited for 1.170199485s due to client-side throttling, not priority and fairness, request: GET:https://api-int.weliang-2142.qe.devcluster.openshift.com:6443/apis/operators.coreos.com/v1?timeout=32s 2022-02-14T15:00:55Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-02-14T15:00:55Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded [weliang@weliang OCP-45842]$ We have a WIP PR against master for the reconciler to exit 0 when it encounters known errors: https://github.com/openshift/whereabouts-cni/pull/84/files This issue happened on my 4.9 SNO cluster as well, I am cloning this to 4.9.z. [bzhai@fedora dual-statck-ai]$ oc get pods|grep -vE "Running|Completed" NAME READY STATUS RESTARTS AGE ip-reconciler-27415725--1-4j87z 0/1 Error 0 9m14s ip-reconciler-27415725--1-9flcp 0/1 Error 0 11m ip-reconciler-27415725--1-hr2s4 0/1 Error 0 13m ip-reconciler-27415725--1-kfcbm 0/1 Error 0 13m ip-reconciler-27415725--1-stpq6 0/1 Error 0 16m ip-reconciler-27415725--1-tp72p 0/1 Error 0 14m ip-reconciler-27415725--1-w7flf 0/1 Error 0 15m ip-reconciler-27415740--1-jnvzl 0/1 Error 0 36s ip-reconciler-27415740--1-klxjr 0/1 Error 0 67s testrun-ip-reconciler--1-6fwjf 0/1 Error 0 72s testrun-ip-reconciler--1-ptx44 0/1 Error 0 40s this issue happened also with one partner using 4.10.9 and baremetal. They tell us this was not happening with 4.10.5, but it is something we can not check. I am not sure if we have version where the bug is fixed. OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9110 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |