Bug 2049108
| Summary: | openshift-installer intermittent failure on AWS with 'Error: Error waiting for NAT Gateway (nat-xxxxx) to become available' | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Greg Sheremeta <gshereme> |
| Component: | Installer | Assignee: | Aditya Narayanaswamy <anarayan> |
| Installer sub component: | openshift-installer | QA Contact: | Yunfei Jiang <yunjiang> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | anarayan, padillon, sdodson |
| Version: | 4.9 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Control plane nodes were sometimes created before the NAT gateway was up
Consequence: Control plane nodes depend on the gateway and fail during installation if they don't find it
Fix: Added a clause in terraform to wait for the NAT gateway to come up
Result: Successful installation with the NAT gateway created first.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 10:45:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Greg Sheremeta
2022-02-01 14:49:25 UTC
There are no occurrences of this in CI over the past 14 days. @gshereme Do you have any insight into whether the NAT gateway was successfully created in AWS? On the surface, this does not look like an eventual-consistency issue in the vein of the others in https://bugzilla.redhat.com/show_bug.cgi?id=2043080. This looks like it could be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1961767, which was a throttling issue where terraform was never able to get a response from AWS that the route53 zone had been created. Having the full install logs would help with the diagnosis (preferably with tracing on and accompanied by CloudTrails data, but that is a tall ask). AWS is reporting that the NAT gateway actually failed. It would be nice if terraform displayed the failure message returned. I don't know that there is anything that the installer can do to not fail here, though. Looking deeper into the aws terraform provider code, this appears to be a bug in the provider. The provider is looking for a "NatGatewayNotFound" error when AWS actually returns an empty response. AWS is likely not returning that the NAT gateway failed. There is a fix for this in the upstream provider. Additionally, there is a fix upstream for the provider reporting the failure message in the case where the NAT gateway actually does fail. The error was not found in the recent CI logs. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |