Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2011939

Summary:	[sig-network] pods should successfully create sandboxes by not timing out
Product:	OpenShift Container Platform	Reporter:	Devan Goodwin <dgoodwin>
Component:	Networking	Assignee:	Dan Winship <danw>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aos-bugs, astoycos, danw, mcambria, sippy, vpickard, wking
Version:	4.10	Keywords:	DeliveryBlocker
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	job=periodic-ci-openshift-release-master-ci-4.10-e2e-azure-upgrade
Last Closed:	2021-11-09 18:30:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Devan Goodwin 2021-10-07 19:08:08 UTC

[sig-network] pods should successfully create sandboxes by not timing out

is failing frequently in CI and seems related to Azure, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=%5Bsig-network%5D%20pods%20should%20successfully%20create%20sandboxes%20by%20not%20timing%20out

Seems to be affecting about 20% of azure runs and we suspect related to some other failures we are seeing.

Specific failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1446112986753142784

Discussion here: https://coreos.slack.com/archives/C01CQA76KMX/p1633631570170400

Network team reports this is prior to networking and should be workload related, so assigning to Node component for a first pass.

In the given example, seems to affect only one master and one kind of image pull.

Comment 1 Dan Winship 2021-10-07 21:02:20 UTC

Doh, I should have caught this before:

> pinging container registry registry.ci.openshift.org: Get "https://registry.ci.openshift.org/v2/": dial tcp 3.210.253.73:443: i/o timeout

"i/o timeout" on an https connection almost always means an MTU problem. ("i/o timeout" as opposed to "connection refused" or "no route to host" means that it successfully *connected*, but then failed to communicate after that point. This tends to happen on https connections when the MTU is wrong and PMTU discovery doesn't work, because the initial TCP handshake goes through but then the packet with the TLS certificate is too big and gets dropped and then both sides are waiting for the other to say something, until one of them times out.)

Anyway, sporadic inexplicable MTU problems out of nowhere are one of Azure's undocumented special features. Though I don't think we had previously seen them with *host-network* traffic though...

drop-icmp.log from master-0:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1446112986753142784/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-sdn_sdn-jwmhc_drop-icmp.log

There are errors in that log, but they seem to be "expected". (We should probably do ">& /dev/null" on the commands that are expected to possibly fail...)

I see that there's a '-j LOG' rule, but I can't find anything logged as a result in master-0's journal... I don't know if that means that we never got any weird ICMP messages from Azure, or just that the logging ends up somewhere else...

(Though I guess that since this is a node-to-cluster-external connection, if there was going to be a spurious ICMP message, it would come from a non-node IP, so it wouldn't hit those rules anyway? Maybe we should log from AZURE_CHECK_ICMP_SOURCE rather than from AZURE_ICMP_ACTION so it logs the messages we *don't* drop too?)

Comment 8 Devan Goodwin 2021-11-09 18:30:26 UTC

No other way to say it, this issue appears to have resolved itself. Using the link in the original report, the test is now passing 98+% of the time. If Dan and network team theory was correct and this was an Azure networking MTU issue, it looks like they may have resolved it on their end.

At this point I think we can close, we no longer have a reproducer. Will reopen if the problem resurfaces.

Comment 9 Devan Goodwin 2021-11-30 12:56:03 UTC

We discovered that testgrid and sippy are both misreporting this test, which runs twice during upgrade tests in two separate openshift-tests invocations (and thus two junits). Testgrid merges the two, sees the same test twice, assumes it was a flake and passed once, when infact the pod-upgrade invocation was a hard fail. Thus sippy thinks this test is passing 99+% of the time when in fact it's often fatally failing on Azure.

Things are underway in related bug https://bugzilla.redhat.com/show_bug.cgi?id=2025967 so we will leave this one closed and move over to the newer bug.

Comment 10 Devan Goodwin 2021-11-30 12:56:32 UTC


*** This bug has been marked as a duplicate of bug 2025967 ***