1752979 – [GCP] e2e failure: "Failed to import expected imagestreams"

Bug 1752979 - [GCP] e2e failure: "Failed to import expected imagestreams"

Summary: [GCP] e2e failure: "Failed to import expected imagestreams"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Abhinav Dahiya
QA Contact:	sheng.lao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-17 17:29 UTC by David Eads
Modified:	2019-10-16 06:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:41:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 2376	0	None	closed	Bug 1752979: data/data/gcp/network: increase the NAT ports for control plane to 7168	2020-06-02 21:01:51 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:41:41 UTC

Description David Eads 2019-09-17 17:29:11 UTC

Multiple different tests all report

```
fail [github.com/openshift/origin/test/extended/builds/new_app.go:33]: Unexpected error:
    <*errors.errorString | 0xc003657e10>: {
        s: "Failed to import expected imagestreams",
    }
    Failed to import expected imagestreams
occurred
```

There are examples in 
1. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/300
2. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/301
3. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/298
4. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/306
5. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/310

It failed half of all failed GCP jobs in the past day. https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-gcp-4.2&sort-by-failures=&sort-by-flakiness=10

Comment 1 Adam Kaplan 2019-09-18 13:58:56 UTC

Moving this to the Networking team.

Per @Oleg imagestream import already retries metadata import at the HTTP client level for temporary issues and network timeouts [1].

This suggests there are two potential issues (not mutually exclusive):
1. Egress reliability on GCP - can we reliably connect to the rest of the Internet from a GCP cluster?
2. Availability of registry.redhat.io - are we seeing global issues when these flakes happen? If not, is the CDN for registry.redhat.io optimized to favor some cloud providers over others?

[1] https://github.com/openshift/library-go/blob/a00adb84bd57c2e01485ee65a74d247ab7570043/pkg/image/registryclient/client.go#L365

Comment 2 Casey Callendrello 2019-09-18 14:23:46 UTC

(Some context from chat)

The connection is from the openshift-kube-apiserver to the registry, which is external. The basic flow diagram is

Pod -> Host -> iptables/conntrack -> GCP NAT gateway -> Akamai -> Registry

Any of these could be failing. Given that we don't see similar issues on other platforms, that makes it more likely that it's either a misconfiguration in the nat router, or that we're not handling the fact that GCP clusters have a sub-1500 MTU.

Anyways, it looks like Abhinav has been doing a lot more GCP nat gateway than I have.

Comment 3 Casey Callendrello 2019-09-18 14:30:35 UTC

We allocate 256 ports per VM. Since every connection is to the same IP (which registry.redhat.io is), then if there are more than 256 concurrent requests to the registry from the same node, then connections will fail. That could be the cause of the problem.

We already have a separate nat gateway for the control plane subnetwork. Given that there are single-digit number of control plane nodes, and there are ~64k available ports, we might as well bump that to from 256 to 5000.

Comment 5 sheng.lao 2019-09-20 12:52:51 UTC

Verified it with 4.2.0-0.nightly-2019-09-20-090334

Get the following value from `https://console.cloud.google.com/net-services/nat`:
 Minimum ports per VM instance   7168

Comment 6 errata-xmlrpc 2019-10-16 06:41:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.