1961767 – Installer times out on route53

Bug 1961767 - Installer times out on route53

Summary: Installer times out on route53

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	aos-install
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-18 16:31 UTC by Ben Parees
Modified:	2021-06-22 22:31 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-14 18:26:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ben Parees 2021-05-18 16:31:23 UTC

The install fails due to a timeout waiting for something in route53.  Can we increase this timeout or do some better retry?

sample job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1394661240285958144

showing up a fair bit in CI:
https://search.ci.openshift.org/?search=Error%3A+error+waiting+for+Route53+Hosted+Zone&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job


Version:
4.8

Platform:
AWS

Please specify:
IPI

What happened?

install failed due to timeout waiting on route53 setup.



INFO[2021-05-18T16:05:34Z] level=error                                  
INFO[2021-05-18T16:05:34Z] level=error msg=Error: error waiting for Route53 Hosted Zone (Z04954351LS1QLYW61YZ4) creation: timeout while waiting for state to become 'INSYNC' (last state: 'PENDING', timeout: 15m0s) 
INFO[2021-05-18T16:05:34Z] level=error                                  
INFO[2021-05-18T16:05:34Z] level=error msg=  on ../tmp/openshift-install-577788877/route53/base.tf line 22, in resource "aws_route53_zone" "new_int": 
INFO[2021-05-18T16:05:34Z] level=error msg=  22: resource "aws_route53_zone" "new_int" { 
INFO[2021-05-18T16:05:34Z] level=error                                  
INFO[2021-05-18T16:05:34Z] level=error                                  
INFO[2021-05-18T16:05:34Z] level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change

Comment 1 Matthew Staebler 2021-05-19 02:46:58 UTC

I don't know that there is anything we should do here. If creating a DNS zone took 15 minutes, then I don't that waiting longer or retrying is going to help.

The timeouts and retries come from the terraform provider. So any changes would need to be made in that provider.

Comment 2 W. Trevor King 2021-05-19 02:54:59 UTC

If it takes AWS more than 15m to allocate a hosted zone, I think the solution is "open a case with AWS to complain", not "make the installer more relaxed".

Comment 3 Ben Parees 2021-05-19 03:20:32 UTC

1) then perhaps we should open the ticket.  it's not happening a lot, but it's certainly not zero and it's pretty consistent.  Narrowing the search to only AWS jobs, 0.23% of all our AWS jobs have failed in this way in the last 2 weeks.  And 0.28% in the last 2 days.  So this is not "aws had a bad day", this is "normal" behavior that's causing 6 jobs a day to unnecessarily fail and be a throw away (and potentially cost someone time to look at the failure, decide it's "benign" and retest as needed).


2) this is another case where I also think our CI jobs need to be smarter.  If our final conclusion on this is "nothing we can do/nothing to see here" then we need to find a way for our CI system to throw the job result away and rerun it so that no one has to look at these failures/retest their pr/have it treated as a failure to accept a payload/etc.

Comment 4 Joseph Callen 2021-05-20 16:57:20 UTC

This happens for vSphere CI jobs as well. We use Route53 for DNS in IPI (VIPs) and UPI.

Comment 11 Matthew Staebler 2021-06-14 18:14:20 UTC

I looked into this some more. From the events captured in CloudTrails, this looks to be a throttling issue.

I considered the cluster created in the following job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1404449526239989760
The hosted zone Z0156195ONMDQWXRHFYR was created at 2021-06-14T15:42:27Z.
During the 20-minute time period starting at 2021-06-14T15:00:00Z, there were 592 total GetChange [1] requests. Of those, 453 of them were rejected by AWS due to throttling. There was never a successful GetChange request for the hosted zone in question.

[1] The AWS terraform provider uses GetChange to determine when the hosted zone has changed its status to INSYNC. The CreateHostedZone response includes a change ID that is then used in the subsequent GetChange requests. The terraform provider waits 30 seconds after the successful CreateHostedZone request then polls GetChange for 15 minutes using an exponential backoff starting at 2 seconds and capping at 10 seconds.

Comment 12 Ben Parees 2021-06-14 18:16:32 UTC

Thanks Matt.  Is the implication then that the hosted zone actually did get created/ready, but we just never successfully made a GetChange request to see that status reflected?

Comment 13 Matthew Staebler 2021-06-14 18:20:45 UTC

(In reply to Ben Parees from comment #12)
> Thanks Matt.  Is the implication then that the hosted zone actually did get
> created/ready, but we just never successfully made a GetChange request to
> see that status reflected?

Yes.

Comment 14 Ben Parees 2021-06-14 18:26:36 UTC

great, in that case the account sharding work that Trevor mentioned should help here.

i'm going to close this out and try to be patient on that :)

Note You need to log in before you can comment on or make changes to this bug.