Bug 1752979

Summary: [GCP] e2e failure: "Failed to import expected imagestreams"
Product: OpenShift Container Platform Reporter: David Eads <deads>
Component: InstallerAssignee: Abhinav Dahiya <adahiya>
Installer sub component: openshift-installer QA Contact: sheng.lao <shlao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: adam.kaplan, aos-bugs, bbennett, jokerman, wzheng
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:41:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Adam Kaplan 2019-09-18 13:58:56 UTC
Moving this to the Networking team.

Per @Oleg imagestream import already retries metadata import at the HTTP client level for temporary issues and network timeouts [1].

This suggests there are two potential issues (not mutually exclusive):
1. Egress reliability on GCP - can we reliably connect to the rest of the Internet from a GCP cluster?
2. Availability of registry.redhat.io - are we seeing global issues when these flakes happen? If not, is the CDN for registry.redhat.io optimized to favor some cloud providers over others?

[1] https://github.com/openshift/library-go/blob/a00adb84bd57c2e01485ee65a74d247ab7570043/pkg/image/registryclient/client.go#L365

Comment 2 Casey Callendrello 2019-09-18 14:23:46 UTC
(Some context from chat)

The connection is from the openshift-kube-apiserver to the registry, which is external. The basic flow diagram is

Pod -> Host -> iptables/conntrack -> GCP NAT gateway -> Akamai -> Registry

Any of these could be failing. Given that we don't see similar issues on other platforms, that makes it more likely that it's either a misconfiguration in the nat router, or that we're not handling the fact that GCP clusters have a sub-1500 MTU.

Anyways, it looks like Abhinav has been doing a lot more GCP nat gateway than I have.

Comment 3 Casey Callendrello 2019-09-18 14:30:35 UTC
We allocate 256 ports per VM. Since every connection is to the same IP (which registry.redhat.io is), then if there are more than 256 concurrent requests to the registry from the same node, then connections will fail. That could be the cause of the problem.

We already have a separate nat gateway for the control plane subnetwork. Given that there are single-digit number of control plane nodes, and there are ~64k available ports, we might as well bump that to from 256 to 5000.

Comment 5 sheng.lao 2019-09-20 12:52:51 UTC
Verified it with 4.2.0-0.nightly-2019-09-20-090334

Get the following value from `https://console.cloud.google.com/net-services/nat`:
 Minimum ports per VM instance   7168

Comment 6 errata-xmlrpc 2019-10-16 06:41:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922