Bug 1752979
Summary: | [GCP] e2e failure: "Failed to import expected imagestreams" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Eads <deads> |
Component: | Installer | Assignee: | Abhinav Dahiya <adahiya> |
Installer sub component: | openshift-installer | QA Contact: | sheng.lao <shlao> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | adam.kaplan, aos-bugs, bbennett, jokerman, wzheng |
Version: | 4.2.0 | ||
Target Milestone: | --- | ||
Target Release: | 4.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-10-16 06:41:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Eads
2019-09-17 17:29:11 UTC
Moving this to the Networking team. Per @Oleg imagestream import already retries metadata import at the HTTP client level for temporary issues and network timeouts [1]. This suggests there are two potential issues (not mutually exclusive): 1. Egress reliability on GCP - can we reliably connect to the rest of the Internet from a GCP cluster? 2. Availability of registry.redhat.io - are we seeing global issues when these flakes happen? If not, is the CDN for registry.redhat.io optimized to favor some cloud providers over others? [1] https://github.com/openshift/library-go/blob/a00adb84bd57c2e01485ee65a74d247ab7570043/pkg/image/registryclient/client.go#L365 (Some context from chat) The connection is from the openshift-kube-apiserver to the registry, which is external. The basic flow diagram is Pod -> Host -> iptables/conntrack -> GCP NAT gateway -> Akamai -> Registry Any of these could be failing. Given that we don't see similar issues on other platforms, that makes it more likely that it's either a misconfiguration in the nat router, or that we're not handling the fact that GCP clusters have a sub-1500 MTU. Anyways, it looks like Abhinav has been doing a lot more GCP nat gateway than I have. We allocate 256 ports per VM. Since every connection is to the same IP (which registry.redhat.io is), then if there are more than 256 concurrent requests to the registry from the same node, then connections will fail. That could be the cause of the problem. We already have a separate nat gateway for the control plane subnetwork. Given that there are single-digit number of control plane nodes, and there are ~64k available ports, we might as well bump that to from 256 to 5000. Verified it with 4.2.0-0.nightly-2019-09-20-090334 Get the following value from `https://console.cloud.google.com/net-services/nat`: Minimum ports per VM instance 7168 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |