Bug 2073452
Summary: | [sig-network] pods should successfully create sandboxes by other - failed (add) | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Devan Goodwin <dgoodwin> | ||||
Component: | Networking | Assignee: | Douglas Smith <dosmith> | ||||
Networking sub component: | multus | QA Contact: | Weibin Liang <weliang> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | high | CC: | bbennett, deads, dperique, ffernand, jluhrsen, kenzhang, nmanos, resoni, sippy, stbenjam, tjungblu, wking | ||||
Version: | 4.11 | Keywords: | Reopened | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.11.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-08-10 11:05:44 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Devan Goodwin
2022-04-08 14:10:39 UTC
The failure seems to have resolved, but none the less we would really love some idea of what happened here. > CNI network "" not found This comes from ocicni, and I believe I've seen this before when there's been problems with the libcni cache. Any chance we can get a look at it from the node side? This happens before Multus is invoked. in ocicni: https://github.com/cri-o/ocicni/blob/master/pkg/ocicni/ocicni.go#L502 This is happening again: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-aws-sdn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1514514596856074240 TRT think we should reconsider the priority. Changed severity to high since this is failing the nightly payload and affecting the org delivery. We separated this specific test case. Here is the distribution of the error: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=openshift-tests-upgrade[…]y%20create%20sandboxes%20by%20adding%20pod%20to%20network. Interestingly it is affecting aws more than others. Overall it is found in 1.48% of runs (7.59% of failures) across 24066 total runs and 2518 jobs (19.50% failed) based on search result: https://search.ci.openshift.org/?search=pods+should+successfully+create+sandboxes+by+adding+pod+to+network&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job I have looked at the new class of issues reported in https://bugzilla.redhat.com/show_bug.cgi?id=2073452#c6 and I've determined it's not node. They look like they can be broken up into two buckets: - some seem to be caused by multus getting unauthorized error https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1516407486423240704 - another seem to be hit by a EOF response from multus https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1040/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway/1516411295291674624 reassigning to multus Updated link: https://sippy.dptools.openshift.org/sippy-ng/tests/4.11/analysis?test=[sig-network]%20pods%20should%20successfully%20create%20sandboxes%20by%20adding%20pod%20to%20network Test passes 80.4% of the time in last week. (82% prior week) Still looks strongly correlated to aws ovn upgrades. some quick notes from my 15m investigation... what stands out to me on an initial look is that the first five [0-4] jobs I looked at for 4.11->4.11 that had this failure all were from "guard" pods. etcd and kube-scheduler-controller guard pods. And the times seen from the "last deletion" of the pods were 3k+ seconds... ~1 hour. FYI: I had a bug like this before where the guard pods were not adhering to a node cordon and were being restarted on the node even if it was drained. Then during an upgrade when the node was rebooted that pod would be started right away before the network was ready and we'd see a failure like this. but those times were closer to 5m. but, with a one hour time frame here, it probably is something different. Also, just to double check I did look at the 4.10->4.11 upgrade job [5] and it's clear that this is failing quite a bit more in 4.11->4.11, so agreed with the regression idea for 4.11. Also, I never saw a "guard" pod with this failure in the few I checked for 4.10->4.11 [0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1518765868308238336 [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1518476445913976832 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1518262542986645504 [3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1517867735122448384 [4] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1517749174576091136 [5] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade&show-stale-tests= *** Bug 2085095 has been marked as a duplicate of this bug. *** This appears to be the number one failing test on gcp upgrade jobs: https://datastudio.google.com/s/vNul4cUKGEA Let's get an update. You can choose a failing run from https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade&show-stale-tests= *** Bug 2090389 has been marked as a duplicate of this bug. *** The originally proposed fix for this was: https://github.com/openshift/cluster-network-operator/pull/1462 (which is since reverted) That has been replaced by: https://github.com/openshift/cluster-network-operator/pull/1472 The root cause of the BZ was not atomically moving the file into place, it was being copied directly into place, which could cause an invocation of a partially written binary. (hence the empty error message or garbage output) The replacement PR is due to the fact that having the same name for the temporary directory could cause timing issues. Checking several failed cases in https://search.ci.openshift.org/?search=pods+should+successfully+create+sandboxes+by+adding+pod+to+network&maxAge=12h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job, I can not find any failures with no message after failed (add): Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 *** Bug 2090547 has been marked as a duplicate of this bug. *** @nmanos we are hitting the same issue with OVN 4.11 clusters, the alertmanager pod is in Container Creating state - https://bugzilla.redhat.com/show_bug.cgi?id=2142513 - https://bugzilla.redhat.com/show_bug.cgi?id=2142461 |