Bug 1844320 - AWS flake: level=error msg="Error: Unable to find matching route for Route Table (...) and destination CIDR block (0.0.0.0/0).
Summary: AWS flake: level=error msg="Error: Unable to find matching route for Route T...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: John Hixson
QA Contact: Yunfei Jiang
URL:
Whiteboard:
Depends On:
Blocks: 1853614
TreeView+ depends on / blocked
 
Reported: 2020-06-05 04:50 UTC by W. Trevor King
Modified: 2020-10-27 16:05 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1853614 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:05:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3837 0 None closed Bug 1844320: Master update terraform provider aws 2.67.0 2021-01-15 18:26:52 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:05:49 UTC

Description W. Trevor King 2020-06-05 04:50:42 UTC
I've seen this a couple of times recently when looking at why PR presubmits failed.  Checking CI search [1] turns up alarming numbers like:

release-openshift-ocp-installer-e2e-aws-fips-4.5 - 41 runs, 27% failed, 36% of failures match

Picking a particular example to ground investigation [2]:

level=error msg="Error: Unable to find matching route for Route Table (rtb-0bd15d60486c91ec2) and destination CIDR block (0.0.0.0/0)."
level=error
level=error msg="  on ../tmp/openshift-install-106441498/vpc/vpc-private.tf line 14, in resource \"aws_route\" \"to_nat_gw\":"
level=error msg="  14: resource \"aws_route\" \"to_nat_gw\" {"
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"

And looking at PR presubmits [3]:

  Across 3943 runs and 269 jobs (64.09% failed), matched 6.49% of failing runs and 25.65% of jobs in 112ms

Looks like this is another AWS-eventual-consistency vs. Terraform-provider bugs, which is being tracked upstream in [4], but I don't see an upstream PR yet.  Also possible that we could address it by raising the aws_route create timeout [5].

Whatever we do in this space, I'd consider backporting to 4.5.  I don't see anything in 4.4 or earlier, so the issue might be due to a 4.4 -> 4.5 Terraform pivot of some sort, although I haven't checked the installer codebase to see what we've done in that space.

[1]: https://search.svc.ci.openshift.org/?search=Unable+to+find+matching+route+for+Route+Table&maxAge=168h&context=1&type=junit&name=release-openshift-ocp&groupBy=job
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/2273#1:build-log.txt%3A35
[3]: https://search.apps.build01.ci.devcluster.openshift.com/?search=Unable+to+find+matching+route+for+Route+Table&maxAge=168h&context=1&type=junit&name=%5Epull-ci-.*-e2e-aws&maxMatches=5&maxBytes=20971520&groupBy=job
[4]: https://github.com/terraform-providers/terraform-provider-aws/issues/13138
[5]: https://www.terraform.io/docs/providers/aws/r/route.html#timeouts

Comment 1 Brenton Leanhardt 2020-06-05 13:08:59 UTC
I agree we should backport this.  It seems like an upstream patch is being proposed.

Comment 2 Yu Qi Zhang 2020-06-11 17:17:42 UTC
Bumping priority a bit, aside from release jobs, this has been showing up a lot for PRs as well for 4.5

Comment 4 John Hixson 2020-06-30 02:16:15 UTC
This issue has been addressed in PR https://github.com/terraform-providers/terraform-provider-aws/pull/13747, which has been merged. It is available in version 2.67.0 of the terraform-provider-aws plugin, so we will need to update. Coming soon.

Comment 5 John Hixson 2020-07-06 19:09:25 UTC
PR: https://github.com/openshift/installer/pull/3837

Comment 8 Yunfei Jiang 2020-07-13 06:01:23 UTC
verified. PASS.

After the PR3837 merged, no such error occurred on 4.6 (approximately 6 days).

mark this bug as VERIFIED.

Comment 10 errata-xmlrpc 2020-10-27 16:05:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.