Bug 2049108 - openshift-installer intermittent failure on AWS with 'Error: Error waiting for NAT Gateway (nat-xxxxx) to become available'
Summary: openshift-installer intermittent failure on AWS with 'Error: Error waiting fo...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.9
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.11.0
Assignee: Aditya Narayanaswamy
QA Contact: Yunfei Jiang
Depends On:
TreeView+ depends on / blocked
Reported: 2022-02-01 14:49 UTC by Greg Sheremeta
Modified: 2022-08-10 10:46 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Control plane nodes were sometimes created before the NAT gateway was up Consequence: Control plane nodes depend on the gateway and fail during installation if they don't find it Fix: Added a clause in terraform to wait for the NAT gateway to come up Result: Successful installation with the NAT gateway created first.
Clone Of:
Last Closed: 2022-08-10 10:45:53 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift installer pull 6016 0 None open Bug 2049108: Fix network interface not found error 2022-06-16 13:00:19 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:46:06 UTC

Description Greg Sheremeta 2022-02-01 14:49:25 UTC
openshift-installer intermittent failure on AWS with 'Error: Error waiting for NAT Gateway (nat-xxxxx) to become available'

I believe this is a variation of Bug 2043080

$ openshift-install version

Platform: AWS -- OSD and ROSA, specifically

Please specify:

What happened?

See log snip below

What did you expect to happen?
Installer creates the NAT gateway --> Successful install

How to reproduce it (as minimally and precisely as possible)?
It is random and rare

Flow seems to be:
1 Installer creates a thing
2 AWS creates it
3 AWS says it doesn't exist
4 Terraform dies

Log snip:
time="2022-01-19T17:24:57Z" level=debug msg="module.dns.aws_route53_record.api_internal_alias[0]: Creation complete after 57s [id=Z022855215ABGYKH12345_api-int.rosacluster1.abcd.p1.openshiftapps.com_A]"
time="2022-01-19T17:24:57Z" level=error
time="2022-01-19T17:24:57Z" level=error msg="Error: Error waiting for NAT Gateway (nat-0f7125846e6512345) to become available: unexpected state 'failed', wanted target 'available'. last error: %!s(<nil>)"
time="2022-01-19T17:24:57Z" level=error
time="2022-01-19T17:24:57Z" level=error msg="  on ../tmp/openshift-install-cluster-417654662/vpc/vpc-public.tf line 85, in resource \"aws_nat_gateway\" \"nat_gw\":"
time="2022-01-19T17:24:57Z" level=error msg="  85: resource \"aws_nat_gateway\" \"nat_gw\" {"
time="2022-01-19T17:24:57Z" level=error
time="2022-01-19T17:24:57Z" level=error
time="2022-01-19T17:24:57Z" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"
time="2022-01-19T17:24:58Z" level=error msg="error after waiting for command completion" error="exit status 1" installID=lx4gblk9

Comment 1 Matthew Staebler 2022-02-01 16:56:26 UTC
There are no occurrences of this in CI over the past 14 days.

@gshereme Do you have any insight into whether the NAT gateway was successfully created in AWS? On the surface, this does not look like an eventual-consistency issue in the vein of the others in https://bugzilla.redhat.com/show_bug.cgi?id=2043080. This looks like it could be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1961767, which was a throttling issue where terraform was never able to get a response from AWS that the route53 zone had been created. Having the full install logs would help with the diagnosis (preferably with tracing on and accompanied by CloudTrails data, but that is a tall ask).

Comment 2 Matthew Staebler 2022-02-01 17:10:04 UTC
AWS is reporting that the NAT gateway actually failed. It would be nice if terraform displayed the failure message returned. I don't know that there is anything that the installer can do to not fail here, though.

Comment 4 Matthew Staebler 2022-02-02 23:53:54 UTC
Looking deeper into the aws terraform provider code, this appears to be a bug in the provider. The provider is looking for a "NatGatewayNotFound" error when AWS actually returns an empty response. AWS is likely not returning that the NAT gateway failed.

There is a fix for this in the upstream provider.

Additionally, there is a fix upstream for the provider reporting the failure message in the case where the NAT gateway actually does fail.

Comment 10 Yunfei Jiang 2022-06-23 07:54:31 UTC
The error was not found in the recent CI logs.

Comment 11 errata-xmlrpc 2022-08-10 10:45:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.