Hide Forgot
Description of problem: upgrade Failed release-openshift-origin-installer-e2e-aws-upgrade-4.0 A snippet of the error logs: `````````````````````````````` Installing from initial release registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-21-152907 level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised" level=info msg="Consuming \"Install Config\" from target directory" level=info msg="Creating infrastructure resources..." level=error level=error msg="Error: Error applying plan:" level=error level=error msg="1 error occurred:" level=error msg="\t* module.vpc.aws_route.to_nat_gw[1]: 1 error occurred:" level=error msg="\t* aws_route.to_nat_gw.1: Error creating route: timeout while waiting for state to become 'success' (timeout: 2m0s)" level=error level=error `````````````````````````````` as seen in https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/437 Version-Release number of the following components: 4.1 How reproducible: Upgrade Test
Installer failed because of rate-limiting of AWS APIs from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/437/artifacts/e2e-aws-upgrade/installer/.openshift_install.log There are a bunch of rate-limit exceeded errors ``` time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.098Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Retrying Request ec2/CreateRoute, attempt 4" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/CreateRoute Details:" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: ---[ REQUEST POST-SIGN ]-----------------------------" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: POST / HTTP/1.1" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Host: ec2.us-east-1.amazonaws.com" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: User-Agent: aws-sdk-go/1.16.6 (go1.10.3; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.10" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Content-Length: 140" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Authorization: AWS4-HMAC-SHA256 Credential=AKIAJ7WKPAWT2BW5KB2A/20190321/us-east-1/ec2/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date, Signature=c148d75dc7e79addd523dfaaafcb168ed64afb820efd9819f1134e75ba155d9d" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Content-Type: application/x-www-form-urlencoded; charset=utf-8" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: X-Amz-Date: 20190321T174123Z" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Accept-Encoding: gzip" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: " time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Action=CreateRoute&DestinationCidrBlock=0.0.0.0%2F0&NatGatewayId=nat-0092145b40c62f90d&RouteTableId=rtb-0cc2b71acf7913565&Version=2016-11-15" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/CreateRoute Details:" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 503 Service Unavailable" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Connection: close" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Transfer-Encoding: chunked" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Date: Thu, 21 Mar 2019 17:41:23 GMT" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Server: AmazonEC2" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: " time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: " time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] <?xml version=\"1.0\" encoding=\"UTF-8\"?>" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: <Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>4e4a313a-0f6a-44d2-a5f4-a30e85bc38e9</RequestID></Response>" time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Validate Response ec2/CreateRoute failed, not retrying, error RequestLimitExceeded: Request limit exceeded." ```
Previous discussion of AWS rate limiting vs. Terraform in bug 1690069.
Created attachment 1546785 [details] Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC This occurred in 73 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours. Generated with [1]: $ deck-build-log-plot 'aws_route.to_nat_gw.* Error creating route: timeout while waiting for state' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
Today we landed a few PRs to help mitigate this: * installer#1442 bumped our Terraform AWS provider up to v2.2, which may have brought along some efficiency improvements. I don't have anything specific here about what those might be, beyond thinking "surely they must have fixed some useful bugs since v1.52 went out on 2018-12-13 :p). * installer#1481 and release#3285 together should have halved our per-zone resource needs, with an associated reduction in AWS API calls and likelyhood of a given cluster hitting an overly-throttled resource. It's hard to measure the impact separately, since we didn't get a lot of soak time on the Terraform bump before the others landed, but grabbing the last 40 minutes of API consumers, we have: $ agents-and-events.py <last-40m.csv | grep '^[0-9]' 13172 aws-sdk-go/1.14.12: 11062 terraform: 6828 destroyer: 6319 openshift-machine-api: 3096 cloud-credential-operator-iam-ro: 3051 aws-elasticloadbalancing: 2521 cloud-credential-operator: ...small fry... That makes Kubernetes our largest AWS API consumer, with the installer (via 'terraform' for creation and 'destroyer' for deletion) seated firmly in second. I think that's sufficent to mark this MODIFIED, although I'm not sure what the process is for getting it closed from there. Do we need to get below a certain fraction of these errors over a certain duration of CI jobs?
So checking rates today, there is only a small decrease in the rates of created NAT gateways: $ grep 'CreateNatGateway\|DescribeNatGateways\|CreateVpc$' api-consumers-2019-03-2* api-consumers-2019-03-25.txt: 15585 ec2.DescribeNatGateways api-consumers-2019-03-25.txt: 1085 ec2.CreateNatGateway api-consumers-2019-03-25.txt: 152 ec2.CreateVpc api-consumers-2019-03-26.txt: 66645 ec2.DescribeNatGateways api-consumers-2019-03-26.txt: 5370 ec2.CreateNatGateway api-consumers-2019-03-26.txt: 978 ec2.CreateVpc api-consumers-2019-03-29.txt: 61463 ec2.DescribeNatGateways api-consumers-2019-03-29.txt: 4466 ec2.CreateNatGateway api-consumers-2019-03-29.txt: 886 ec2.CreateVpc with API-consumer files generated with [1]. That's 7.1, 5.5, and 5.0 CreateNatGateway calls per VPC for each of the days where I happened to sync down the nightly 24-hour report. 5.5 to 5.0 is not a particularly inspiring reduction, when I had expected 6 -> 3. Spot-checking a recent installer PR, we are only creating three NAT gateways there: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1483/pull-ci-openshift-installer-master-e2e-aws/4798/artifacts/e2e-aws/installer/.openshift_install.log | grep 'nat_gateway.*Creation' time="2019-03-29T17:27:36Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[1]: Creation complete after 1m36s (ID: nat-02e3e9699e4cda551)" time="2019-03-29T17:27:57Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[2]: Creation complete after 1m56s (ID: nat-065ed68f2030ecf80)" time="2019-03-29T17:28:08Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[0]: Creation complete after 2m7s (ID: nat-08101d95f8e9200f6)" Same for origin release-promotion gates: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6317/artifacts/e2e-aws/installer/.openshift_install.log | grep 'nat_gateway.*Creation' time="2019-03-29T18:30:25Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[2]: Creation complete after 1m58s (ID: nat-0851535a532052f00)" time="2019-03-29T18:30:43Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[1]: Creation complete after 2m16s (ID: nat-0d79a329860f3df66)" time="2019-03-29T18:30:44Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[0]: Creation complete after 2m16s (ID: nat-0123c2fbfdb2098de)" I'm hunting around to see if I can figure out where all the unexpected CreateNatGateway calls are coming from. [1]: https://github.com/wking/openshift-release/tree/debug-scripts/agents-and-events
Ok, after we exclude errors, the step is down to ~4 CreateNatGateway calls per VPC [1]. I'm still not sure why that isn't three... [1]: https://raw.githubusercontent.com/wking/openshift-release/91cde42964c0764985116dd447fe14a4861d0900/vpcs/calls-per-vpc.png
Previous version of the plot script was confused by the stacked histogram. Here's a fixed graph [1] generated with [2] that shows the drop is from 6 to 3. [1]: https://raw.githubusercontent.com/wking/openshift-release/46a3de9c3a8ef2e15d4c1e1f84e80a4e95d70105/vpcs/calls-per-vpc.png [2]: https://github.com/wking/openshift-release/commit/46a3de9c3a8ef2e15d4c1e1f84e80a4e95d70105
Verified this bug with openshift-install v4.0.22-201903272149-dirty (extracted from 4.0.0-0.nightly-2019-03-29-144824) + 4.0.0-0.nightly-2019-03-28-030453 release payload, this issue is not reproduced, move this to 'VERIFIED'
Also in this space, [1] landed Saturday with a bump to a 20 minute timeout. That seems to have reduced the aws_route timeouts, but we still see them (with peak usage in the CI account ~1k successful CreateRouteTable calls with two-hour binning). [1]: https://github.com/openshift/installer/pull/1682
Created attachment 1559996 [details] CreateRouteTable calls from 2019-04-23 through 2019-04-29
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758