Bug 1752979 - [GCP] e2e failure: "Failed to import expected imagestreams"
Summary: [GCP] e2e failure: "Failed to import expected imagestreams"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.2.0
Assignee: Abhinav Dahiya
QA Contact: sheng.lao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-17 17:29 UTC by David Eads
Modified: 2019-10-16 06:41 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:41:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 2376 0 None closed Bug 1752979: data/data/gcp/network: increase the NAT ports for control plane to 7168 2020-06-02 21:01:51 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:41:41 UTC

Comment 1 Adam Kaplan 2019-09-18 13:58:56 UTC
Moving this to the Networking team.

Per @Oleg imagestream import already retries metadata import at the HTTP client level for temporary issues and network timeouts [1].

This suggests there are two potential issues (not mutually exclusive):
1. Egress reliability on GCP - can we reliably connect to the rest of the Internet from a GCP cluster?
2. Availability of registry.redhat.io - are we seeing global issues when these flakes happen? If not, is the CDN for registry.redhat.io optimized to favor some cloud providers over others?

[1] https://github.com/openshift/library-go/blob/a00adb84bd57c2e01485ee65a74d247ab7570043/pkg/image/registryclient/client.go#L365

Comment 2 Casey Callendrello 2019-09-18 14:23:46 UTC
(Some context from chat)

The connection is from the openshift-kube-apiserver to the registry, which is external. The basic flow diagram is

Pod -> Host -> iptables/conntrack -> GCP NAT gateway -> Akamai -> Registry

Any of these could be failing. Given that we don't see similar issues on other platforms, that makes it more likely that it's either a misconfiguration in the nat router, or that we're not handling the fact that GCP clusters have a sub-1500 MTU.

Anyways, it looks like Abhinav has been doing a lot more GCP nat gateway than I have.

Comment 3 Casey Callendrello 2019-09-18 14:30:35 UTC
We allocate 256 ports per VM. Since every connection is to the same IP (which registry.redhat.io is), then if there are more than 256 concurrent requests to the registry from the same node, then connections will fail. That could be the cause of the problem.

We already have a separate nat gateway for the control plane subnetwork. Given that there are single-digit number of control plane nodes, and there are ~64k available ports, we might as well bump that to from 256 to 5000.

Comment 5 sheng.lao 2019-09-20 12:52:51 UTC
Verified it with 4.2.0-0.nightly-2019-09-20-090334

Get the following value from `https://console.cloud.google.com/net-services/nat`:
 Minimum ports per VM instance   7168

Comment 6 errata-xmlrpc 2019-10-16 06:41:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.