Bug 1927264

Summary:	FailedCreatePodSandBox due to multus inability to reach apiserver
Product:	OpenShift Container Platform	Reporter:	Standa Laznicka <slaznick>
Component:	Networking	Assignee:	Douglas Smith <dosmith>
Networking sub component:	multus	QA Contact:	Ying Wang <yingwang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	high	CC:	anbhat, ccoleman, dcbw, jluhrsen, pmuller, surya, wking
Version:	4.7	Keywords:	Upgrades
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	tag-ci
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-network] pods should successfully create sandboxes by other [sig-network] pods should successfully create sandboxes by getting pod
Last Closed:	2021-07-27 22:43:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1943566

Description Standa Laznicka 2021-02-10 12:15:30 UTC

Description of problem:

The test check named "[sig-network] pods should successfully create sandboxes by .*" fails very commonly throughout the whole CI. While I understand that this could be happening for a number of reasons, a single person assigned with watching the whole CI cannot loop through all the failed tests.

Refer to the following search in order to see all the related test failures:

https://search.ci.openshift.org/?search=pods+should+successfully+create+sandboxes+by+&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):


How reproducible:
it's getting hit by the CI very often


Steps to Reproduce:
1. install the thing in a CI-observed environment

Actual results:
CI is red

Expected results:
CI is green

Comment 3 Alexander Constantinescu 2021-04-12 15:19:36 UTC

*** Bug 1948066 has been marked as a duplicate of this bug. ***

Comment 4 W. Trevor King 2021-04-12 20:43:10 UTC

Definitely still showing up in CI.  Bug 1948066 shows this as common in periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade [1].  And searching more broadly over CI runs from the past 24h:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=pods+should+successfully+create+sandboxes+by+other' | grep 'failures match' | sort 
periodic-ci-openshift-release-master-ci-4.6-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-azure (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-azure-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade (all) - 11 runs, 100% failed, 36% of failures match = 36% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade (all) - 12 runs, 58% failed, 86% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade (all) - 4 runs, 75% failed, 100% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 12 runs, 67% failed, 13% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 16 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 13 runs, 100% failed, 77% of failures match = 77% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp (all) - 12 runs, 100% failed, 42% of failures match = 42% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 12 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-aws-workers-rhel7 (all) - 4 runs, 25% failed, 400% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-ovirt (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere-serial (all) - 5 runs, 80% failed, 25% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere-upi (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-vsphere-upi-serial (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-vsphere (all) - 10 runs, 70% failed, 100% of failures match = 70% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-vsphere-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-okd-4.6-e2e-aws (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-okd-4.7-e2e-aws (all) - 4 runs, 25% failed, 400% of failures match = 100% impact
promote-release-openshift-okd-machine-os-content-e2e-aws-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
...
pull-ci-openshift-ovn-kubernetes-release-4.6-e2e-vsphere-ovn (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
rehearse-15394-pull-ci-openshift-cluster-kube-scheduler-operator-master-e2e-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
...
rehearse-17604-pull-ci-operator-framework-operator-marketplace-release-4.9-e2e-aws-upgrade (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
release-openshift-ocp-installer-e2e-openstack-4.6 (all) - 3 runs, 33% failed, 200% of failures match = 67% impact
release-openshift-ocp-installer-e2e-remote-libvirt-ppc64le-4.6 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.6 (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 10 runs, 60% failed, 117% of failures match = 70% impact
release-openshift-origin-installer-e2e-aws-calico-4.6 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-disruptive-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-azure-4.6 (all) - 3 runs, 33% failed, 200% of failures match = 67% impact
release-openshift-origin-installer-old-rhcos-e2e-aws-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade

Comment 5 W. Trevor King 2021-04-12 20:45:22 UTC

Adding the Upgrades keyword, because this is the main contributor to 4.7 -> 4.8 CI failures.

Comment 7 Dan Williams 2021-04-16 17:06:38 UTC

If we are locking the apiserver to its local etcd, shouldn't it exit fairly quickly if it cannot talk to its local etcd? Then at least the apiserver is NotReady and the cloud LB knows, and won't load balance clients to it anymore during an upgrade.

Comment 8 Dan Williams 2021-04-16 17:20:15 UTC

To clarify more...

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1381803484973633536

1. Multus was asked to start pods

2. Multus is given an apiserver address that corresponds the AWS ELB that fronts the apiservers (10.0.237.67:6443)

3. Multus was being ELB-ed to an apiserver whose local etcd was down for an extended period of time; perhaps being upgraded as this was an "aws-upgrade" test


4. Since the apiserver is locked to a local etcd, it was still responding to the ELB, but wasn't able to talk to etcd for 2+ minutes before terminating
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1381803484973633536/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-kube-apiserver_kube-apiserver-ip-10-0-165-43.ec2.internal_kube-apiserver.log

W0413 04:19:33.601568      16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.165.43:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.165.43:2379: connect: connection refused". Reconnecting...
...
W0413 04:21:42.418346      16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.165.43:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.165.43:2379: connect: connection refused". Reconnecting...
I0413 04:21:43.820619      16 genericapiserver.go:700] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-ip-10-0-165-43.ec2.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationGracefulTerminationFinished' All pending requests processed
I0413 04:21:43.847255       1 main.go:198] Termination finished with exit code 0
I0413 04:21:43.847347       1 main.go:151] Deleting termination lock file "/var/log/kube-apiserver/.terminating"

5. etcd was told to terminate at 04:19:33
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1381803484973633536/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-etcd_etcd-ip-10-0-165-43.ec2.internal_etcd.log
{"level":"info","ts":"2021-04-13T04:19:33.598Z","caller":"osutil/interrupt_unix.go:63","msg":"received signal; shutting down","signal":"terminated"}

6. etcd comes back 3.5 minutes later:
{"level":"info","ts":"2021-04-13T04:23:14.379Z","caller":"etcdmain/etcd.go:134","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}

7. The apiserver error was reflected back to Multus, which caused pod start errors:
ns/openshift-marketplace pod/redhat-marketplace-wzw4r node/ip-10-0-145-105.ec2.internal - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_redhat-marketplace-wzw4r_openshift-marketplace_cd4d2af3-099f-4a1c-afc9-eb3c162c9b1c_0(4484bc4201c2dbfd84d6e1f6afd5d079e7efe007237ad09096016f2f7f017c1c): Multus: [openshift-marketplace/redhat-marketplace-wzw4r]: error getting pod: Get "https://[api-int.ci-op-zpti5x55-1158e.origin-ci-int-aws.dev.rhcloud.com]:6443/api/v1/namespaces/openshift-marketplace/pods/redhat-marketplace-wzw4r?timeout=1m0s": dial tcp 10.0.237.67:6443: connect: connection refused

8. Which trips this synthetic upgrade test that no pod should ever fail to start during an upgrade

Comment 9 Clayton Coleman 2021-04-30 02:26:07 UTC

Why is apiserver locked to a local etcd?  That's not the expected behavior - it should have balanced over. Would be a fairly serious error in that case.

Note that "dial tcp 10.0.237.67:6443: connect: connection refused" is also AWS ELB issues (they don't work right now the way we'd expect https://bugzilla.redhat.com/show_bug.cgi?id=1943804)

Multus should retry a reasonable number of connections when unable to connect to the apiserver (connection refused, possibly) which Doug and I talked about.  I don't know that we have a bug for that right now.  Anything in the critical path of starting pods should absorb some minor disruption to apiserver connectivity (specifically inability to connect to an apiserver) for some minimal amount of time (dunno what is too long but 10-20s may be the limit).  We can route any fixes there via another bug ("multus should tolerate minimum disruption of apiserver connectivity failure in order to start pods more smoothly")

In the meantime I'm specifically flagging this condition as known issue rather than causing the test to fail (big source of upgrade failures).

Comment 10 Clayton Coleman 2021-04-30 02:32:43 UTC

Wait... does this mean that if we get an apiserver LB disruption no pod network pods start?  So a partitioned node from apiserver will begin degrading immediately because it can't restart containers?

That's a pretty serious failure mode - we may need to up the level of attention we place on mitigating this (CNI as an interface... has problems).  Can we do better without completely redesigning multus to avoid needing to make a pod retrieval call on every single sandbox recreate?

Comment 11 Clayton Coleman 2021-04-30 02:33:27 UTC

Mitigating test impact in https://github.com/openshift/origin/pull/26115

Comment 12 Dan Williams 2021-04-30 17:02:34 UTC

(In reply to Clayton Coleman from comment #10)
> Wait... does this mean that if we get an apiserver LB disruption no pod
> network pods start?  So a partitioned node from apiserver will begin
> degrading immediately because it can't restart containers?
> 
> That's a pretty serious failure mode - we may need to up the level of
> attention we place on mitigating this (CNI as an interface... has problems).
> Can we do better without completely redesigning multus to avoid needing to
> make a pod retrieval call on every single sandbox recreate?

Do the kubelets go through the LB? If not, maybe multus should use whatever apiserver kubelet does.

If kubelet does go through the LB, obviously there's still a window for failure. One option could be for Multus to detect apiserver timeout problems and just start the pod with the default network anyway, and log a big fat warning. At least you'll get pods coming up; but it would be hard to make the warning visible since Multus couldn't post the event to the Pod object.

Clearly there's more we can do here, and we certainly should have some chaos testing around killing apiservers behind our manually-managed apiserver LB.

Comment 13 Petr Muller 2021-05-12 12:16:24 UTC

Are these failures (in non-upgrade jobs) the same thing? If so, should we extend https://github.com/openshift/origin/pull/26115 to also waive `i/o timeout` in addition to `connection refused`?

ns/e2e-test-oc-debug-ccgnp pod/local-busybox1-1-deploy node/ci-op-g52jtpn7-f23e1-zpkq7-worker-b-2q8ms - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_local-busybox1-1-deploy_e2e-test-oc-debug-ccgnp_e5bd1990-2b48-4ee1-928d-f57e50934833_0(16cc5491deae08f2096e00dcbe3ef5055fe6a487d652c04b84e9356e01a1dbd4): Multus: [e2e-test-oc-debug-ccgnp/local-busybox1-1-deploy]: error getting pod: Get "https://[api-int.ci-op-g52jtpn7-f23e1.gcp-2.ci.openshift.org]:6443/api/v1/namespaces/e2e-test-oc-debug-ccgnp/pods/local-busybox1-1-deploy?timeout=1m0s": dial tcp 10.0.0.4:6443: i/o timeout

Occurrences:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1392413115529826304
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1392314309614243840
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1392099923645698048

Comment 16 jamo luhrsen 2021-05-18 17:25:19 UTC

We are still seeing these failures in every upgrade job here:
  https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade

The PR to origin did not help, if that was what we were hoping.

Comment 20 W. Trevor King 2021-05-20 19:54:30 UTC

Dropping in some test-case names for Sippy, based on bug 1948066 being closed as a dup of this one.

Comment 21 Suresh Kolichala 2021-05-21 14:15:11 UTC

I think the etcd errors are red-herring. In etcd v3.4 (which is what is used in OCP 4.7), etcd client uses grpc1.14 client, which keeps active subconnections to all endpoints, and applies RoundRobinBalanced algorithm[1]. In this case, when the local etcd is rebooting due to upgrade, the subconnection with the localhost and that ip address is lost, and hence it generates bunch of error messages.

But that doesn't mean the other two etcd servers are not serving the requests. Since there is no evidence in the logs that etcd lost the quorum at any point of time during the upgrade, and there is a constant stream of successful requests served all the active servers all time during the upgrade, I believe etcd is not a problem at all.

The problem with etcd client side logging is that while it logs the failures to subconnections, it doesn't log the successful connections to the other servers -- which could be misleading.

If we look at the log snippet from the above comments, the first two error messages are about the etcd on localhost and 10.0.165.43 being not reachable are benign messages about those subconnections. There is no failed transaction due to these two error messaages. The next line in the log is a whole 1 second and 400 milliseconds later about apiserver terminating.

2021-04-13T04:21:42.333918580Z W0413 04:21:42.333819 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://localhost:2379 <nil> 0 <nil>}. Err :connection error: desc = "transp
ort: Error while dialing dial tcp [::1]:2379: connect: connection refused". Reconnecting...
2021-04-13T04:21:42.418454979Z W0413 04:21:42.418346 16 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.165.43:2379 <nil> 0 <nil>}. Err :connection error: desc = "tran
sport: Error while dialing dial tcp 10.0.165.43:2379: connect: connection refused". Reconnecting...
2021-04-13T04:21:43.820738799Z I0413 04:21:43.820619 16 genericapiserver.go:700] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-ip-10-0-165-43.ec2.internal",
UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TerminationGracefulTerminationFinished' All pending requests processed
2021-04-13T04:21:43.847329647Z I0413 04:21:43.847255 1 main.go:198] Termination finished with exit code 0

The real problem appears to be the unavailability of apiserver ELB (10.0.237.67:6443) while attempting to create sandbox:

Apr 13 04:21:14.315 W ns/openshift-marketplace pod/redhat-marketplace-wzw4r node/ip-10-0-145-105.ec2.internal reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_redhat-marketplace-wzw4r_openshift-marketplace_cd4d2af3-099f-4a1c-afc9-eb3c162c9b1c_0(4484bc4201c2dbfd84d6e1f6afd5d079e7efe007237ad09096016f2f7f017c1c): Multus: [openshift-marketplace/redhat-marketplace-wzw4r]: error getting pod: Get "https://[api-int.ci-op-zpti5x55-1158e.origin-ci-int-aws.dev.rhcloud.com]:6443/api/v1/namespaces/openshift-marketplace/pods/redhat-marketplace-wzw4r?timeout=1m0s": dial tcp 10.0.237.67:6443: connect: connection refused

I do not know the root cause for the unavailability of the API Server ELB (apiservers were terminating?), but it definitely is not due to etcd. etcd servers were available and serving the requests well during the entire period of the upgrade.

tl;dr: the etcd error messages are red-herring. etcd was working fine during the entire period. Transferring the BZ back to networking.

[1]https://etcd.io/docs/v3.3/learning/client-architecture/

Comment 22 W. Trevor King 2021-05-25 22:51:11 UTC

This is causing difficulties in CI, like blocking pause-compute updates from 4.6 to 4.8 before they attempt the 4.7 -> 4.8 leg [1].  If we don't have any warm leads on this, can we skip the test so this bug won't block that sort of thing (and we'll also get a clearer signal for the 4.7 -> 4.8 update jobs mentioned in bug 1948066)?

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/15939/rehearse-15939-periodic-ci-openshift-release-master-stable-4.8-upgrade-from-stable-4.6-e2e-aws-upgrade-paused/1395655807638441984

Comment 23 W. Trevor King 2021-05-25 23:05:19 UTC

I'm too slow.  We've had [1] in place for a few weeks now (mentioned in comment 11).  And [2] is in flight to add more cases, it just hasn't landed yet.

[1]: https://github.com/openshift/origin/pull/26115
[2]: https://github.com/openshift/origin/pull/26152

Comment 24 Douglas Smith 2021-06-04 17:31:49 UTC

Antonio Ojea advised that a client-go update to Multus may help to allieviate this i/o timeout condition.

Comment 26 Ying Wang 2021-06-09 09:23:51 UTC

In ci job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1401739903280091136.
The cases of 'sig-network] pods should successfully create sandboxes by .*' are passed.

Comment 30 errata-xmlrpc 2021-07-27 22:43:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438