Bug 1691516 - aws_route.to_nat_gw.1: Error creating route: timeout while waiting for state to become 'success'
Summary: aws_route.to_nat_gw.1: Error creating route: timeout while waiting for state ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.1.0
Assignee: Abhinav Dahiya
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-21 19:25 UTC by bpeterse
Modified: 2019-06-04 10:46 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:46:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (338.74 KB, image/svg+xml)
2019-03-22 06:31 UTC, W. Trevor King
no flags Details
CreateRouteTable calls from 2019-04-23 through 2019-04-29 (143.75 KB, image/png)
2019-04-29 17:33 UTC, W. Trevor King
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 1442 0 None closed Bug 1659970: terraform/exec/plugins/vendor: Bump terraform-provider-aws to v2.2.0 2020-06-24 15:06:41 UTC
Github openshift installer pull 1481 0 None closed pkg/asset/machines/worker: Structured workers 2020-06-24 15:06:40 UTC
Github openshift installer pull 1682 0 None closed data/aws: 20-minute create timeouts for routes and security groups 2020-06-24 15:06:40 UTC
Github openshift release pull 3285 0 None closed ci-operator/templates/openshift: Explicitly set AWS availability zones 2020-06-24 15:06:40 UTC
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:46:28 UTC

Description bpeterse 2019-03-21 19:25:52 UTC
Description of problem:

upgrade Failed release-openshift-origin-installer-e2e-aws-upgrade-4.0

A snippet of the error logs:

``````````````````````````````
Installing from initial release registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-21-152907
level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating infrastructure resources..."
level=error
level=error msg="Error: Error applying plan:"
level=error
level=error msg="1 error occurred:"
level=error msg="\t* module.vpc.aws_route.to_nat_gw[1]: 1 error occurred:"
level=error msg="\t* aws_route.to_nat_gw.1: Error creating route: timeout while waiting for state to become 'success' (timeout: 2m0s)"
level=error
level=error
``````````````````````````````
as seen in https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/437

Version-Release number of the following components:
4.1

How reproducible:
Upgrade Test

Comment 1 Abhinav Dahiya 2019-03-21 22:40:12 UTC
Installer failed because of rate-limiting of AWS APIs

from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/437/artifacts/e2e-aws-upgrade/installer/.openshift_install.log

There are a bunch of rate-limit exceeded errors
```
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.098Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Retrying Request ec2/CreateRoute, attempt 4"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/CreateRoute Details:"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: ---[ REQUEST POST-SIGN ]-----------------------------"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: POST / HTTP/1.1"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Host: ec2.us-east-1.amazonaws.com"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: User-Agent: aws-sdk-go/1.16.6 (go1.10.3; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.10"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Content-Length: 140"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Authorization: AWS4-HMAC-SHA256 Credential=AKIAJ7WKPAWT2BW5KB2A/20190321/us-east-1/ec2/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date, Signature=c148d75dc7e79addd523dfaaafcb168ed64afb820efd9819f1134e75ba155d9d"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Content-Type: application/x-www-form-urlencoded; charset=utf-8"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: X-Amz-Date: 20190321T174123Z"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Accept-Encoding: gzip"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Action=CreateRoute&DestinationCidrBlock=0.0.0.0%2F0&NatGatewayId=nat-0092145b40c62f90d&RouteTableId=rtb-0cc2b71acf7913565&Version=2016-11-15"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/CreateRoute Details:"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 503 Service Unavailable"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Connection: close"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Transfer-Encoding: chunked"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Date: Thu, 21 Mar 2019 17:41:23 GMT"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Server: AmazonEC2"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] <?xml version=\"1.0\" encoding=\"UTF-8\"?>"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: <Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>4e4a313a-0f6a-44d2-a5f4-a30e85bc38e9</RequestID></Response>"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Validate Response ec2/CreateRoute failed, not retrying, error RequestLimitExceeded: Request limit exceeded."
```

Comment 2 W. Trevor King 2019-03-22 06:27:27 UTC
Previous discussion of AWS rate limiting vs. Terraform in bug 1690069.

Comment 3 W. Trevor King 2019-03-22 06:31:53 UTC
Created attachment 1546785 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 73 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'aws_route.to_nat_gw.* Error creating route: timeout while waiting for state'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 4 W. Trevor King 2019-03-28 22:14:07 UTC
Today we landed a few PRs to help mitigate this:

* installer#1442 bumped our Terraform AWS provider up to v2.2, which may have brought along some efficiency improvements.  I don't have anything specific here about what those might be, beyond thinking "surely they must have fixed some useful bugs since v1.52 went out on 2018-12-13 :p).
* installer#1481 and release#3285 together should have halved our per-zone resource needs, with an associated reduction in AWS API calls and likelyhood of a given cluster hitting an overly-throttled resource.

It's hard to measure the impact separately, since we didn't get a lot of soak time on the Terraform bump before the others landed, but grabbing the last 40 minutes of API consumers, we have:

$ agents-and-events.py <last-40m.csv | grep '^[0-9]'
13172	aws-sdk-go/1.14.12:
11062	terraform:
6828	destroyer:
6319	openshift-machine-api:
3096	cloud-credential-operator-iam-ro:
3051	aws-elasticloadbalancing:
2521	cloud-credential-operator:
...small fry...

That makes Kubernetes our largest AWS API consumer, with the installer (via 'terraform' for creation and 'destroyer' for deletion) seated firmly in second.  I think that's sufficent to mark this MODIFIED, although I'm not sure what the process is for getting it closed from there.  Do we need to get below a certain fraction of these errors over a certain duration of CI jobs?

Comment 6 W. Trevor King 2019-03-29 19:45:42 UTC
So checking rates today, there is only a small decrease in the rates of created NAT gateways:

$ grep 'CreateNatGateway\|DescribeNatGateways\|CreateVpc$' api-consumers-2019-03-2*
api-consumers-2019-03-25.txt:	15585	ec2.DescribeNatGateways
api-consumers-2019-03-25.txt:	1085	ec2.CreateNatGateway
api-consumers-2019-03-25.txt:	152	ec2.CreateVpc
api-consumers-2019-03-26.txt:	66645	ec2.DescribeNatGateways
api-consumers-2019-03-26.txt:	5370	ec2.CreateNatGateway
api-consumers-2019-03-26.txt:	978	ec2.CreateVpc
api-consumers-2019-03-29.txt:	61463	ec2.DescribeNatGateways
api-consumers-2019-03-29.txt:	4466	ec2.CreateNatGateway
api-consumers-2019-03-29.txt:	886	ec2.CreateVpc

with API-consumer files generated with [1].  That's 7.1, 5.5, and 5.0 CreateNatGateway calls per VPC for each of the days where I happened to sync down the nightly 24-hour report.  5.5 to 5.0 is not a particularly inspiring reduction, when I had expected 6 -> 3.  Spot-checking a recent installer PR, we are only creating three NAT gateways there:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1483/pull-ci-openshift-installer-master-e2e-aws/4798/artifacts/e2e-aws/installer/.openshift_install.log | grep 'nat_gateway.*Creation'
time="2019-03-29T17:27:36Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[1]: Creation complete after 1m36s (ID: nat-02e3e9699e4cda551)"
time="2019-03-29T17:27:57Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[2]: Creation complete after 1m56s (ID: nat-065ed68f2030ecf80)"
time="2019-03-29T17:28:08Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[0]: Creation complete after 2m7s (ID: nat-08101d95f8e9200f6)"

Same for origin release-promotion gates:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6317/artifacts/e2e-aws/installer/.openshift_install.log | grep 'nat_gateway.*Creation'
time="2019-03-29T18:30:25Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[2]: Creation complete after 1m58s (ID: nat-0851535a532052f00)"
time="2019-03-29T18:30:43Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[1]: Creation complete after 2m16s (ID: nat-0d79a329860f3df66)"
time="2019-03-29T18:30:44Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[0]: Creation complete after 2m16s (ID: nat-0123c2fbfdb2098de)"

I'm hunting around to see if I can figure out where all the unexpected CreateNatGateway calls are coming from.

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/agents-and-events

Comment 7 W. Trevor King 2019-03-29 21:26:10 UTC
Ok, after we exclude errors, the step is down to ~4 CreateNatGateway calls per VPC [1].  I'm still not sure why that isn't three...

[1]: https://raw.githubusercontent.com/wking/openshift-release/91cde42964c0764985116dd447fe14a4861d0900/vpcs/calls-per-vpc.png

Comment 8 W. Trevor King 2019-03-29 22:13:53 UTC
Previous version of the plot script was confused by the stacked histogram.  Here's a fixed graph [1] generated with [2] that shows the drop is from 6 to 3.


[1]: https://raw.githubusercontent.com/wking/openshift-release/46a3de9c3a8ef2e15d4c1e1f84e80a4e95d70105/vpcs/calls-per-vpc.png
[2]: https://github.com/wking/openshift-release/commit/46a3de9c3a8ef2e15d4c1e1f84e80a4e95d70105

Comment 9 Johnny Liu 2019-04-01 06:26:12 UTC
Verified this bug with openshift-install v4.0.22-201903272149-dirty (extracted from 4.0.0-0.nightly-2019-03-29-144824) + 4.0.0-0.nightly-2019-03-28-030453 release payload, this issue is not reproduced, move this to 'VERIFIED'

Comment 10 W. Trevor King 2019-04-29 17:32:13 UTC
Also in this space, [1] landed Saturday with a bump to a 20 minute timeout.  That seems to have reduced the aws_route timeouts, but we still see them (with peak usage in the CI account ~1k successful CreateRouteTable calls with two-hour binning).

[1]: https://github.com/openshift/installer/pull/1682

Comment 11 W. Trevor King 2019-04-29 17:33:44 UTC
Created attachment 1559996 [details]
CreateRouteTable calls from 2019-04-23 through 2019-04-29

Comment 13 errata-xmlrpc 2019-06-04 10:46:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.