openshift-installer intermittent failure on AWS with 'Error: Error waiting for NAT Gateway (nat-xxxxx) to become available' I believe this is a variation of Bug 2043080 $ openshift-install version 4.9.x Platform: AWS -- OSD and ROSA, specifically Please specify: IPI What happened? See log snip below What did you expect to happen? Installer creates the NAT gateway --> Successful install How to reproduce it (as minimally and precisely as possible)? It is random and rare Flow seems to be: 1 Installer creates a thing 2 AWS creates it 3 AWS says it doesn't exist 4 Terraform dies Log snip: time="2022-01-19T17:24:57Z" level=debug msg="module.dns.aws_route53_record.api_internal_alias[0]: Creation complete after 57s [id=Z022855215ABGYKH12345_api-int.rosacluster1.abcd.p1.openshiftapps.com_A]" time="2022-01-19T17:24:57Z" level=error time="2022-01-19T17:24:57Z" level=error msg="Error: Error waiting for NAT Gateway (nat-0f7125846e6512345) to become available: unexpected state 'failed', wanted target 'available'. last error: %!s(<nil>)" time="2022-01-19T17:24:57Z" level=error time="2022-01-19T17:24:57Z" level=error msg=" on ../tmp/openshift-install-cluster-417654662/vpc/vpc-public.tf line 85, in resource \"aws_nat_gateway\" \"nat_gw\":" time="2022-01-19T17:24:57Z" level=error msg=" 85: resource \"aws_nat_gateway\" \"nat_gw\" {" time="2022-01-19T17:24:57Z" level=error time="2022-01-19T17:24:57Z" level=error time="2022-01-19T17:24:57Z" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change" time="2022-01-19T17:24:58Z" level=error msg="error after waiting for command completion" error="exit status 1" installID=lx4gblk9
There are no occurrences of this in CI over the past 14 days. @gshereme Do you have any insight into whether the NAT gateway was successfully created in AWS? On the surface, this does not look like an eventual-consistency issue in the vein of the others in https://bugzilla.redhat.com/show_bug.cgi?id=2043080. This looks like it could be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1961767, which was a throttling issue where terraform was never able to get a response from AWS that the route53 zone had been created. Having the full install logs would help with the diagnosis (preferably with tracing on and accompanied by CloudTrails data, but that is a tall ask).
AWS is reporting that the NAT gateway actually failed. It would be nice if terraform displayed the failure message returned. I don't know that there is anything that the installer can do to not fail here, though.
Looking deeper into the aws terraform provider code, this appears to be a bug in the provider. The provider is looking for a "NatGatewayNotFound" error when AWS actually returns an empty response. AWS is likely not returning that the NAT gateway failed. There is a fix for this in the upstream provider. Additionally, there is a fix upstream for the provider reporting the failure message in the case where the NAT gateway actually does fail.
The error was not found in the recent CI logs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069