Bug 1844320

Summary: AWS flake: level=error msg="Error: Unable to find matching route for Route Table (...) and destination CIDR block (0.0.0.0/0).
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: InstallerAssignee: John Hixson <jhixson>
Installer sub component: openshift-installer QA Contact: Yunfei Jiang <yunjiang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: adahiya, akiselev, bleanhar, dgoodwin, jerzhang
Version: 4.5   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1853614 (view as bug list) Environment:
Last Closed: 2020-10-27 16:05:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1853614    

Description W. Trevor King 2020-06-05 04:50:42 UTC
I've seen this a couple of times recently when looking at why PR presubmits failed.  Checking CI search [1] turns up alarming numbers like:

release-openshift-ocp-installer-e2e-aws-fips-4.5 - 41 runs, 27% failed, 36% of failures match

Picking a particular example to ground investigation [2]:

level=error msg="Error: Unable to find matching route for Route Table (rtb-0bd15d60486c91ec2) and destination CIDR block (0.0.0.0/0)."
level=error
level=error msg="  on ../tmp/openshift-install-106441498/vpc/vpc-private.tf line 14, in resource \"aws_route\" \"to_nat_gw\":"
level=error msg="  14: resource \"aws_route\" \"to_nat_gw\" {"
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"

And looking at PR presubmits [3]:

  Across 3943 runs and 269 jobs (64.09% failed), matched 6.49% of failing runs and 25.65% of jobs in 112ms

Looks like this is another AWS-eventual-consistency vs. Terraform-provider bugs, which is being tracked upstream in [4], but I don't see an upstream PR yet.  Also possible that we could address it by raising the aws_route create timeout [5].

Whatever we do in this space, I'd consider backporting to 4.5.  I don't see anything in 4.4 or earlier, so the issue might be due to a 4.4 -> 4.5 Terraform pivot of some sort, although I haven't checked the installer codebase to see what we've done in that space.

[1]: https://search.svc.ci.openshift.org/?search=Unable+to+find+matching+route+for+Route+Table&maxAge=168h&context=1&type=junit&name=release-openshift-ocp&groupBy=job
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/2273#1:build-log.txt%3A35
[3]: https://search.apps.build01.ci.devcluster.openshift.com/?search=Unable+to+find+matching+route+for+Route+Table&maxAge=168h&context=1&type=junit&name=%5Epull-ci-.*-e2e-aws&maxMatches=5&maxBytes=20971520&groupBy=job
[4]: https://github.com/terraform-providers/terraform-provider-aws/issues/13138
[5]: https://www.terraform.io/docs/providers/aws/r/route.html#timeouts

Comment 1 Brenton Leanhardt 2020-06-05 13:08:59 UTC
I agree we should backport this.  It seems like an upstream patch is being proposed.

Comment 2 Yu Qi Zhang 2020-06-11 17:17:42 UTC
Bumping priority a bit, aside from release jobs, this has been showing up a lot for PRs as well for 4.5

Comment 4 John Hixson 2020-06-30 02:16:15 UTC
This issue has been addressed in PR https://github.com/terraform-providers/terraform-provider-aws/pull/13747, which has been merged. It is available in version 2.67.0 of the terraform-provider-aws plugin, so we will need to update. Coming soon.

Comment 5 John Hixson 2020-07-06 19:09:25 UTC
PR: https://github.com/openshift/installer/pull/3837

Comment 8 Yunfei Jiang 2020-07-13 06:01:23 UTC
verified. PASS.

After the PR3837 merged, no such error occurred on 4.6 (approximately 6 days).

mark this bug as VERIFIED.

Comment 10 errata-xmlrpc 2020-10-27 16:05:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196