The install fails due to a timeout waiting for something in route53. Can we increase this timeout or do some better retry? sample job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1394661240285958144 showing up a fair bit in CI: https://search.ci.openshift.org/?search=Error%3A+error+waiting+for+Route53+Hosted+Zone&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Version: 4.8 Platform: AWS Please specify: IPI What happened? install failed due to timeout waiting on route53 setup. INFO[2021-05-18T16:05:34Z] level=error INFO[2021-05-18T16:05:34Z] level=error msg=Error: error waiting for Route53 Hosted Zone (Z04954351LS1QLYW61YZ4) creation: timeout while waiting for state to become 'INSYNC' (last state: 'PENDING', timeout: 15m0s) INFO[2021-05-18T16:05:34Z] level=error INFO[2021-05-18T16:05:34Z] level=error msg= on ../tmp/openshift-install-577788877/route53/base.tf line 22, in resource "aws_route53_zone" "new_int": INFO[2021-05-18T16:05:34Z] level=error msg= 22: resource "aws_route53_zone" "new_int" { INFO[2021-05-18T16:05:34Z] level=error INFO[2021-05-18T16:05:34Z] level=error INFO[2021-05-18T16:05:34Z] level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change
I don't know that there is anything we should do here. If creating a DNS zone took 15 minutes, then I don't that waiting longer or retrying is going to help. The timeouts and retries come from the terraform provider. So any changes would need to be made in that provider.
If it takes AWS more than 15m to allocate a hosted zone, I think the solution is "open a case with AWS to complain", not "make the installer more relaxed".
1) then perhaps we should open the ticket. it's not happening a lot, but it's certainly not zero and it's pretty consistent. Narrowing the search to only AWS jobs, 0.23% of all our AWS jobs have failed in this way in the last 2 weeks. And 0.28% in the last 2 days. So this is not "aws had a bad day", this is "normal" behavior that's causing 6 jobs a day to unnecessarily fail and be a throw away (and potentially cost someone time to look at the failure, decide it's "benign" and retest as needed). 2) this is another case where I also think our CI jobs need to be smarter. If our final conclusion on this is "nothing we can do/nothing to see here" then we need to find a way for our CI system to throw the job result away and rerun it so that no one has to look at these failures/retest their pr/have it treated as a failure to accept a payload/etc.
This happens for vSphere CI jobs as well. We use Route53 for DNS in IPI (VIPs) and UPI.
I looked into this some more. From the events captured in CloudTrails, this looks to be a throttling issue. I considered the cluster created in the following job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1404449526239989760 The hosted zone Z0156195ONMDQWXRHFYR was created at 2021-06-14T15:42:27Z. During the 20-minute time period starting at 2021-06-14T15:00:00Z, there were 592 total GetChange [1] requests. Of those, 453 of them were rejected by AWS due to throttling. There was never a successful GetChange request for the hosted zone in question. [1] The AWS terraform provider uses GetChange to determine when the hosted zone has changed its status to INSYNC. The CreateHostedZone response includes a change ID that is then used in the subsequent GetChange requests. The terraform provider waits 30 seconds after the successful CreateHostedZone request then polls GetChange for 15 minutes using an exponential backoff starting at 2 seconds and capping at 10 seconds.
Thanks Matt. Is the implication then that the hosted zone actually did get created/ready, but we just never successfully made a GetChange request to see that status reflected?
(In reply to Ben Parees from comment #12) > Thanks Matt. Is the implication then that the hosted zone actually did get > created/ready, but we just never successfully made a GetChange request to > see that status reflected? Yes.
great, in that case the account sharding work that Trevor mentioned should help here. i'm going to close this out and try to be patient on that :)