Created attachment 1871461 [details] sippy snapshot [sig-network] pods should successfully create sandboxes by other Has showed an alarming pattern of failure in a batch of 10 AWS jobs, see: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bsig-network%5D%20pods%20should%20successfully%20create%20sandboxes%20by%20other for latest data but will attach a screenshot showing what it's at right now. In the screenshot you will see a dip on April 4 and immediate recovery which can be ignored, this was the RHCOS bump revert. You'll see a noticable dip today. This aggregated job shows hits for this failure in almost all 9 of the 10 jobs in the batch: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-aws-sdn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1512265204111511552 But to pinpoint a specific instance, we'll choose the first one which shows the following errors: : [sig-network] pods should successfully create sandboxes by other expand_less 0s { 7 failures to create the sandbox ns/openshift-etcd pod/revision-pruner-9-ip-10-0-151-27.us-west-2.compute.internal node/ip-10-0-151-27.us-west-2.compute.internal - 251.54 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-9-ip-10-0-151-27.us-west-2.compute.internal_openshift-etcd_baf9884e-7dad-46dd-870c-192a21bac2f2_0(d1cdcc7ae0f07ad1c8c380e09f06e54f3f4432adf0a7b25b4a4c3fa455ec174f): error adding pod openshift-etcd_revision-pruner-9-ip-10-0-151-27.us-west-2.compute.internal to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): ns/openshift-etcd pod/revision-pruner-9-ip-10-0-159-138.us-west-2.compute.internal node/ip-10-0-159-138.us-west-2.compute.internal - 260.12 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-9-ip-10-0-159-138.us-west-2.compute.internal_openshift-etcd_62670171-bf24-4e59-a86d-1b8034847bd8_0(8bd07be233f5baa22354e8ced8bbc62a5be1e7362c080245a4dc69baefa16909): error adding pod openshift-etcd_revision-pruner-9-ip-10-0-159-138.us-west-2.compute.internal to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): ns/openshift-multus pod/multus-admission-controller-87sr2 node/ip-10-0-151-27.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_multus-admission-controller-87sr2_openshift-multus_abebeb53-c696-49e5-ac18-16f2ead4a950_0(336bdce2e06edf17425ec3061cda71d41b0aa0d3b8b6851c9a78a6b33e97c437): error adding pod openshift-multus_multus-admission-controller-87sr2 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): ns/openshift-controller-manager pod/controller-manager-t87w2 node/ip-10-0-151-27.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-t87w2_openshift-controller-manager_d275d9d2-e297-4cb2-a470-079291b61f4d_0(6df7ba35a9ac8ecbe75855f4e3e7077ac707f7fee226ed722f81b218d61ffd20): error adding pod openshift-controller-manager_controller-manager-t87w2 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): ns/openshift-apiserver pod/apiserver-8664865b6d-fg9c7 node/ip-10-0-222-126.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_apiserver-8664865b6d-fg9c7_openshift-apiserver_96bbf479-06de-40d7-9b00-e4b91c0038b3_0(779f7fd93897f608e9f5c2f0624b8ec3c6187ba9d29895459b80794c3587a176): No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? ns/openshift-network-diagnostics pod/network-check-target-9t8gg node/ip-10-0-159-138.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get network status for pod sandbox k8s_network-check-target-9t8gg_openshift-network-diagnostics_4a0bd6a2-4bba-4235-a787-de0be6ae1a1d_0(965de1ebf8dd5743f974f5dda6f643a7e53b539bea0cd8ed7b37e48c0e6a61c9): CNI network "" not found ns/openshift-e2e-loki pod/loki-promtail-g4b2q node/ip-10-0-159-138.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_loki-promtail-g4b2q_openshift-e2e-loki_10b2061d-5254-4aec-8b34-da9c34cb5228_0(a152194b1fc5cb21efb4d06e5195c2c05260431f4641b3482434d47331b577dd): error adding pod openshift-e2e-loki_loki-promtail-g4b2q to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): } This "failed (add)" appears in virtually every AWS job in that batch of 10. TRT believes this is very unusual for a test normally at 99% pass rate on AWS. We do not see any relevant changes in the payload, it is possible something has slipped in recently to trigger that got past aggregation.
The failure seems to have resolved, but none the less we would really love some idea of what happened here.
> CNI network "" not found This comes from ocicni, and I believe I've seen this before when there's been problems with the libcni cache. Any chance we can get a look at it from the node side? This happens before Multus is invoked. in ocicni: https://github.com/cri-o/ocicni/blob/master/pkg/ocicni/ocicni.go#L502
This is happening again: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-aws-sdn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1514514596856074240 TRT think we should reconsider the priority.
Changed severity to high since this is failing the nightly payload and affecting the org delivery.
We separated this specific test case. Here is the distribution of the error: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=openshift-tests-upgrade[…]y%20create%20sandboxes%20by%20adding%20pod%20to%20network. Interestingly it is affecting aws more than others. Overall it is found in 1.48% of runs (7.59% of failures) across 24066 total runs and 2518 jobs (19.50% failed) based on search result: https://search.ci.openshift.org/?search=pods+should+successfully+create+sandboxes+by+adding+pod+to+network&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
I have looked at the new class of issues reported in https://bugzilla.redhat.com/show_bug.cgi?id=2073452#c6 and I've determined it's not node. They look like they can be broken up into two buckets: - some seem to be caused by multus getting unauthorized error https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1516407486423240704 - another seem to be hit by a EOF response from multus https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1040/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway/1516411295291674624 reassigning to multus
Updated link: https://sippy.dptools.openshift.org/sippy-ng/tests/4.11/analysis?test=[sig-network]%20pods%20should%20successfully%20create%20sandboxes%20by%20adding%20pod%20to%20network Test passes 80.4% of the time in last week. (82% prior week) Still looks strongly correlated to aws ovn upgrades.
some quick notes from my 15m investigation... what stands out to me on an initial look is that the first five [0-4] jobs I looked at for 4.11->4.11 that had this failure all were from "guard" pods. etcd and kube-scheduler-controller guard pods. And the times seen from the "last deletion" of the pods were 3k+ seconds... ~1 hour. FYI: I had a bug like this before where the guard pods were not adhering to a node cordon and were being restarted on the node even if it was drained. Then during an upgrade when the node was rebooted that pod would be started right away before the network was ready and we'd see a failure like this. but those times were closer to 5m. but, with a one hour time frame here, it probably is something different. Also, just to double check I did look at the 4.10->4.11 upgrade job [5] and it's clear that this is failing quite a bit more in 4.11->4.11, so agreed with the regression idea for 4.11. Also, I never saw a "guard" pod with this failure in the few I checked for 4.10->4.11 [0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1518765868308238336 [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1518476445913976832 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1518262542986645504 [3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1517867735122448384 [4] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1517749174576091136 [5] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade&show-stale-tests=
*** Bug 2085095 has been marked as a duplicate of this bug. ***
This appears to be the number one failing test on gcp upgrade jobs: https://datastudio.google.com/s/vNul4cUKGEA Let's get an update. You can choose a failing run from https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade&show-stale-tests=
*** Bug 2090389 has been marked as a duplicate of this bug. ***
The originally proposed fix for this was: https://github.com/openshift/cluster-network-operator/pull/1462 (which is since reverted) That has been replaced by: https://github.com/openshift/cluster-network-operator/pull/1472 The root cause of the BZ was not atomically moving the file into place, it was being copied directly into place, which could cause an invocation of a partially written binary. (hence the empty error message or garbage output) The replacement PR is due to the fact that having the same name for the temporary directory could cause timing issues.
Checking several failed cases in https://search.ci.openshift.org/?search=pods+should+successfully+create+sandboxes+by+adding+pod+to+network&maxAge=12h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job, I can not find any failures with no message after failed (add):
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
*** Bug 2090547 has been marked as a duplicate of this bug. ***
@nmanos we are hitting the same issue with OVN 4.11 clusters, the alertmanager pod is in Container Creating state - https://bugzilla.redhat.com/show_bug.cgi?id=2142513 - https://bugzilla.redhat.com/show_bug.cgi?id=2142461