Bug 2049108

Summary: openshift-installer intermittent failure on AWS with 'Error: Error waiting for NAT Gateway (nat-xxxxx) to become available'
Product: OpenShift Container Platform Reporter: Greg Sheremeta <gshereme>
Component: InstallerAssignee: Aditya Narayanaswamy <anarayan>
Installer sub component: openshift-installer QA Contact: Yunfei Jiang <yunjiang>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: anarayan, padillon, sdodson
Version: 4.9   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Control plane nodes were sometimes created before the NAT gateway was up Consequence: Control plane nodes depend on the gateway and fail during installation if they don't find it Fix: Added a clause in terraform to wait for the NAT gateway to come up Result: Successful installation with the NAT gateway created first.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:45:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Greg Sheremeta 2022-02-01 14:49:25 UTC
openshift-installer intermittent failure on AWS with 'Error: Error waiting for NAT Gateway (nat-xxxxx) to become available'

I believe this is a variation of Bug 2043080

$ openshift-install version
4.9.x

Platform: AWS -- OSD and ROSA, specifically

Please specify:
IPI

What happened?

See log snip below

What did you expect to happen?
Installer creates the NAT gateway --> Successful install

How to reproduce it (as minimally and precisely as possible)?
It is random and rare

Flow seems to be:
1 Installer creates a thing
2 AWS creates it
3 AWS says it doesn't exist
4 Terraform dies


Log snip:
time="2022-01-19T17:24:57Z" level=debug msg="module.dns.aws_route53_record.api_internal_alias[0]: Creation complete after 57s [id=Z022855215ABGYKH12345_api-int.rosacluster1.abcd.p1.openshiftapps.com_A]"
time="2022-01-19T17:24:57Z" level=error
time="2022-01-19T17:24:57Z" level=error msg="Error: Error waiting for NAT Gateway (nat-0f7125846e6512345) to become available: unexpected state 'failed', wanted target 'available'. last error: %!s(<nil>)"
time="2022-01-19T17:24:57Z" level=error
time="2022-01-19T17:24:57Z" level=error msg="  on ../tmp/openshift-install-cluster-417654662/vpc/vpc-public.tf line 85, in resource \"aws_nat_gateway\" \"nat_gw\":"
time="2022-01-19T17:24:57Z" level=error msg="  85: resource \"aws_nat_gateway\" \"nat_gw\" {"
time="2022-01-19T17:24:57Z" level=error
time="2022-01-19T17:24:57Z" level=error
time="2022-01-19T17:24:57Z" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"
time="2022-01-19T17:24:58Z" level=error msg="error after waiting for command completion" error="exit status 1" installID=lx4gblk9

Comment 1 Matthew Staebler 2022-02-01 16:56:26 UTC
There are no occurrences of this in CI over the past 14 days.

@gshereme Do you have any insight into whether the NAT gateway was successfully created in AWS? On the surface, this does not look like an eventual-consistency issue in the vein of the others in https://bugzilla.redhat.com/show_bug.cgi?id=2043080. This looks like it could be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1961767, which was a throttling issue where terraform was never able to get a response from AWS that the route53 zone had been created. Having the full install logs would help with the diagnosis (preferably with tracing on and accompanied by CloudTrails data, but that is a tall ask).

Comment 2 Matthew Staebler 2022-02-01 17:10:04 UTC
AWS is reporting that the NAT gateway actually failed. It would be nice if terraform displayed the failure message returned. I don't know that there is anything that the installer can do to not fail here, though.

Comment 4 Matthew Staebler 2022-02-02 23:53:54 UTC
Looking deeper into the aws terraform provider code, this appears to be a bug in the provider. The provider is looking for a "NatGatewayNotFound" error when AWS actually returns an empty response. AWS is likely not returning that the NAT gateway failed.

There is a fix for this in the upstream provider.

Additionally, there is a fix upstream for the provider reporting the failure message in the case where the NAT gateway actually does fail.

Comment 10 Yunfei Jiang 2022-06-23 07:54:31 UTC
The error was not found in the recent CI logs.

Comment 11 errata-xmlrpc 2022-08-10 10:45:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069