Bug 2011939
| Summary: | [sig-network] pods should successfully create sandboxes by not timing out | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Devan Goodwin <dgoodwin> |
| Component: | Networking | Assignee: | Dan Winship <danw> |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | aos-bugs, astoycos, danw, mcambria, sippy, vpickard, wking |
| Version: | 4.10 | Keywords: | DeliveryBlocker |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
job=periodic-ci-openshift-release-master-ci-4.10-e2e-azure-upgrade
|
|
| Last Closed: | 2021-11-09 18:30:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Devan Goodwin
2021-10-07 19:08:08 UTC
Doh, I should have caught this before: > pinging container registry registry.ci.openshift.org: Get "https://registry.ci.openshift.org/v2/": dial tcp 3.210.253.73:443: i/o timeout "i/o timeout" on an https connection almost always means an MTU problem. ("i/o timeout" as opposed to "connection refused" or "no route to host" means that it successfully *connected*, but then failed to communicate after that point. This tends to happen on https connections when the MTU is wrong and PMTU discovery doesn't work, because the initial TCP handshake goes through but then the packet with the TLS certificate is too big and gets dropped and then both sides are waiting for the other to say something, until one of them times out.) Anyway, sporadic inexplicable MTU problems out of nowhere are one of Azure's undocumented special features. Though I don't think we had previously seen them with *host-network* traffic though... drop-icmp.log from master-0: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1446112986753142784/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-sdn_sdn-jwmhc_drop-icmp.log There are errors in that log, but they seem to be "expected". (We should probably do ">& /dev/null" on the commands that are expected to possibly fail...) I see that there's a '-j LOG' rule, but I can't find anything logged as a result in master-0's journal... I don't know if that means that we never got any weird ICMP messages from Azure, or just that the logging ends up somewhere else... (Though I guess that since this is a node-to-cluster-external connection, if there was going to be a spurious ICMP message, it would come from a non-node IP, so it wouldn't hit those rules anyway? Maybe we should log from AZURE_CHECK_ICMP_SOURCE rather than from AZURE_ICMP_ACTION so it logs the messages we *don't* drop too?) No other way to say it, this issue appears to have resolved itself. Using the link in the original report, the test is now passing 98+% of the time. If Dan and network team theory was correct and this was an Azure networking MTU issue, it looks like they may have resolved it on their end. At this point I think we can close, we no longer have a reproducer. Will reopen if the problem resurfaces. We discovered that testgrid and sippy are both misreporting this test, which runs twice during upgrade tests in two separate openshift-tests invocations (and thus two junits). Testgrid merges the two, sees the same test twice, assumes it was a flake and passed once, when infact the pod-upgrade invocation was a hard fail. Thus sippy thinks this test is passing 99+% of the time when in fact it's often fatally failing on Azure. Things are underway in related bug https://bugzilla.redhat.com/show_bug.cgi?id=2025967 so we will leave this one closed and move over to the newer bug. *** This bug has been marked as a duplicate of bug 2025967 *** |