Bug 2073452 - [sig-network] pods should successfully create sandboxes by other - failed (add)
Summary: [sig-network] pods should successfully create sandboxes by other - failed (add)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Douglas Smith
QA Contact: Weibin Liang
URL:
Whiteboard:
: 2085095 2090389 2090547 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-08 14:10 UTC by Devan Goodwin
Modified: 2022-11-15 17:48 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:05:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sippy snapshot (583.35 KB, image/png)
2022-04-08 14:10 UTC, Devan Goodwin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1462 0 None open Bug 2073452: Copying CNI binaries should be an atomic operation. 2022-05-26 21:25:48 UTC
Github openshift cluster-network-operator pull 1472 0 None open Bug 2073452: Copying CNI binaries should be an atomic operation. 2022-06-01 19:17:50 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:05:58 UTC

Description Devan Goodwin 2022-04-08 14:10:39 UTC
Created attachment 1871461 [details]
sippy snapshot

[sig-network] pods should successfully create sandboxes by other

Has showed an alarming pattern of failure in a batch of 10 AWS jobs, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bsig-network%5D%20pods%20should%20successfully%20create%20sandboxes%20by%20other for latest data but will attach a screenshot showing what it's at right now.

In the screenshot you will see a dip on April 4 and immediate recovery which can be ignored, this was the RHCOS bump revert.

You'll see a noticable dip today.

This aggregated job shows hits for this failure in almost all 9 of the 10 jobs in the batch: 

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-aws-sdn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1512265204111511552

But to pinpoint a specific instance, we'll choose the first one which shows the following errors:

: [sig-network] pods should successfully create sandboxes by other expand_less 	0s
{  7 failures to create the sandbox

ns/openshift-etcd pod/revision-pruner-9-ip-10-0-151-27.us-west-2.compute.internal node/ip-10-0-151-27.us-west-2.compute.internal - 251.54 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-9-ip-10-0-151-27.us-west-2.compute.internal_openshift-etcd_baf9884e-7dad-46dd-870c-192a21bac2f2_0(d1cdcc7ae0f07ad1c8c380e09f06e54f3f4432adf0a7b25b4a4c3fa455ec174f): error adding pod openshift-etcd_revision-pruner-9-ip-10-0-151-27.us-west-2.compute.internal to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): 
ns/openshift-etcd pod/revision-pruner-9-ip-10-0-159-138.us-west-2.compute.internal node/ip-10-0-159-138.us-west-2.compute.internal - 260.12 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-9-ip-10-0-159-138.us-west-2.compute.internal_openshift-etcd_62670171-bf24-4e59-a86d-1b8034847bd8_0(8bd07be233f5baa22354e8ced8bbc62a5be1e7362c080245a4dc69baefa16909): error adding pod openshift-etcd_revision-pruner-9-ip-10-0-159-138.us-west-2.compute.internal to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): 
ns/openshift-multus pod/multus-admission-controller-87sr2 node/ip-10-0-151-27.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_multus-admission-controller-87sr2_openshift-multus_abebeb53-c696-49e5-ac18-16f2ead4a950_0(336bdce2e06edf17425ec3061cda71d41b0aa0d3b8b6851c9a78a6b33e97c437): error adding pod openshift-multus_multus-admission-controller-87sr2 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): 
ns/openshift-controller-manager pod/controller-manager-t87w2 node/ip-10-0-151-27.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-t87w2_openshift-controller-manager_d275d9d2-e297-4cb2-a470-079291b61f4d_0(6df7ba35a9ac8ecbe75855f4e3e7077ac707f7fee226ed722f81b218d61ffd20): error adding pod openshift-controller-manager_controller-manager-t87w2 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): 
ns/openshift-apiserver pod/apiserver-8664865b6d-fg9c7 node/ip-10-0-222-126.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_apiserver-8664865b6d-fg9c7_openshift-apiserver_96bbf479-06de-40d7-9b00-e4b91c0038b3_0(779f7fd93897f608e9f5c2f0624b8ec3c6187ba9d29895459b80794c3587a176): No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
ns/openshift-network-diagnostics pod/network-check-target-9t8gg node/ip-10-0-159-138.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get network status for pod sandbox k8s_network-check-target-9t8gg_openshift-network-diagnostics_4a0bd6a2-4bba-4235-a787-de0be6ae1a1d_0(965de1ebf8dd5743f974f5dda6f643a7e53b539bea0cd8ed7b37e48c0e6a61c9): CNI network "" not found
ns/openshift-e2e-loki pod/loki-promtail-g4b2q node/ip-10-0-159-138.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_loki-promtail-g4b2q_openshift-e2e-loki_10b2061d-5254-4aec-8b34-da9c34cb5228_0(a152194b1fc5cb21efb4d06e5195c2c05260431f4641b3482434d47331b577dd): error adding pod openshift-e2e-loki_loki-promtail-g4b2q to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): }


This "failed (add)" appears in virtually every AWS job in that batch of 10. TRT believes this is very unusual for a test normally at 99% pass rate on AWS.

We do not see any relevant changes in the payload, it is possible something has slipped in recently to trigger that got past aggregation.

Comment 1 Devan Goodwin 2022-04-11 12:11:52 UTC
The failure seems to have resolved, but none the less we would really love some idea of what happened here.

Comment 3 Douglas Smith 2022-04-13 13:08:41 UTC
> CNI network "" not found

This comes from ocicni, and I believe I've seen this before when there's been problems with the libcni cache. Any chance we can get a look at it from the node side? This happens before Multus is invoked.

in ocicni: https://github.com/cri-o/ocicni/blob/master/pkg/ocicni/ocicni.go#L502

Comment 5 Ken Zhang 2022-04-14 13:24:01 UTC
Changed severity to high since this is failing the nightly payload and affecting the org delivery.

Comment 6 Ken Zhang 2022-04-18 15:51:16 UTC
We separated this specific test case. Here is the distribution of the error: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=openshift-tests-upgrade[…]y%20create%20sandboxes%20by%20adding%20pod%20to%20network. Interestingly it is affecting aws more than others. 

Overall it is found in 1.48% of runs (7.59% of failures) across 24066 total runs and 2518 jobs (19.50% failed) based on search result: https://search.ci.openshift.org/?search=pods+should+successfully+create+sandboxes+by+adding+pod+to+network&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 7 Peter Hunt 2022-04-19 17:13:41 UTC
I have looked at the new class of issues reported in https://bugzilla.redhat.com/show_bug.cgi?id=2073452#c6 and I've determined it's not node. They look like they can be broken up into two buckets:
- some seem to be caused by multus getting unauthorized error https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1516407486423240704
- another seem to be hit by a EOF response from multus https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1040/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway/1516411295291674624

reassigning to multus

Comment 9 Devan Goodwin 2022-04-26 16:55:46 UTC
Updated link: https://sippy.dptools.openshift.org/sippy-ng/tests/4.11/analysis?test=[sig-network]%20pods%20should%20successfully%20create%20sandboxes%20by%20adding%20pod%20to%20network

Test passes 80.4% of the time in last week. (82% prior week)

Still looks strongly correlated to aws ovn upgrades.

Comment 10 jamo luhrsen 2022-04-26 18:01:14 UTC
some quick notes from my 15m investigation...

what stands out to me on an initial look is that the first five [0-4] jobs I looked at for 4.11->4.11 that had this failure
all were from "guard" pods. etcd and kube-scheduler-controller guard pods. And the times seen from the "last deletion" of
the pods were 3k+ seconds... ~1 hour.

FYI:
I had a bug like this before where the guard pods were not adhering to a node cordon and were being restarted on the node
even if it was drained. Then during an upgrade when the node was rebooted that pod would be started right away before the
network was ready and we'd see a failure like this. but those times were closer to 5m.

but, with a one hour time frame here, it probably is something different.

Also, just to double check I did look at the 4.10->4.11 upgrade job [5] and it's clear that this is failing quite a bit more
in 4.11->4.11, so agreed with the regression idea for 4.11. Also, I never saw a "guard" pod with this failure in the 
few I checked for 4.10->4.11

[0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1518765868308238336
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1518476445913976832
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1518262542986645504
[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1517867735122448384
[4] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1517749174576091136
[5] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade&show-stale-tests=

Comment 12 Douglas Smith 2022-05-12 19:25:51 UTC
*** Bug 2085095 has been marked as a duplicate of this bug. ***

Comment 13 David Eads 2022-05-19 13:22:07 UTC
This appears to be the number one failing test on gcp upgrade jobs: https://datastudio.google.com/s/vNul4cUKGEA  Let's get an update.

You can choose a failing run from https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade&show-stale-tests=

Comment 15 Douglas Smith 2022-05-31 13:58:37 UTC
*** Bug 2090389 has been marked as a duplicate of this bug. ***

Comment 16 Douglas Smith 2022-06-01 19:17:37 UTC
The originally proposed fix for this was: https://github.com/openshift/cluster-network-operator/pull/1462 (which is since reverted)

That has been replaced by: https://github.com/openshift/cluster-network-operator/pull/1472 

The root cause of the BZ was not atomically moving the file into place, it was being copied directly into place, which could cause an invocation of a partially written binary. (hence the empty error message or garbage output)

The replacement PR is due to the fact that having the same name for the temporary directory could cause timing issues.

Comment 22 errata-xmlrpc 2022-08-10 11:05:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 25 Douglas Smith 2022-11-09 14:23:21 UTC
*** Bug 2090547 has been marked as a duplicate of this bug. ***

Comment 26 Rewant 2022-11-15 07:36:15 UTC
@nmanos we are hitting the same issue with OVN 4.11 clusters, the alertmanager pod is in Container Creating state
- https://bugzilla.redhat.com/show_bug.cgi?id=2142513
- https://bugzilla.redhat.com/show_bug.cgi?id=2142461


Note You need to log in before you can comment on or make changes to this bug.