Bug 2019977 - Installer doesn't validate region causing binary to hang with a 60 minute timeout
Summary: Installer doesn't validate region causing binary to hang with a 60 minute tim...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: Aditya Narayanaswamy
QA Contact: Yunfei Jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-03 18:37 UTC by Will Gordon
Modified: 2022-03-10 16:25 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
If the given region for AWS is invalid, the installer tries to validate the availability zones in the given region and has no timeout causing the installer to stall for 60 minutes. Added a check for validating the region and also the service endpoints before the availability zones to reduce the time taken by the installer binary to report the error.
Clone Of:
Environment:
Last Closed: 2022-03-10 16:24:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Pull secret has been sanitized (769 bytes, text/plain)
2021-11-03 18:37 UTC, Will Gordon
no flags Details
install log (4.22 KB, text/plain)
2021-11-08 23:16 UTC, Will Gordon
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 5432 0 None open Bug 2019977: Validate region provided in install config 2021-11-30 16:25:10 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:25:11 UTC

Description Will Gordon 2021-11-03 18:37:05 UTC
Created attachment 1839717 [details]
Pull secret has been sanitized

Version:

$ openshift-install version
./openshift-install 4.9.4
built from commit 6e5b992ba719dd4ea2d0c2a8b08ecad45179e553
release image quay.io/openshift-release-dev/ocp-release@sha256:3d5800990dee7cd4727d3fe238a97e2d2976d3808fc925ada29c559a47e2e1ef
release architecture amd64

Platform: aws (likely equally valid for the other cloud platforms)

Please specify: IPI



What happened?

Providing an invalid AWS region for `openshift-install`, e.g., us-xyz-1, causes the installer to hang for 60 minutes until finally timing out.



What did you expect to happen?

The installer should be able to detect an invalid region, fail fast, and warn the user accordingly



How to reproduce it (as minimally and precisely as possible)?

(refer to the attached install-config.yaml for what to include in the "test" directory)

$ ./openshift-install create manifests --dir test



Anything else we need to know?

Output from above command with --log-level debug, with no indication that the command is hanging due to the region

```
DEBUG OpenShift Installer 4.9.4
DEBUG Built from commit 6e5b992ba719dd4ea2d0c2a8b08ecad45179e553
DEBUG Fetching Master Machines...
DEBUG Loading Master Machines...
DEBUG   Loading Cluster ID...
DEBUG     Loading Install Config...
DEBUG       Loading SSH Key...
DEBUG       Using SSH Key loaded from state file
DEBUG       Loading Base Domain...
DEBUG         Loading Platform...
DEBUG         Using Platform loaded from state file
DEBUG       Using Base Domain loaded from state file
DEBUG       Loading Cluster Name...
DEBUG         Loading Base Domain...
DEBUG         Loading Platform...
DEBUG       Using Cluster Name loaded from state file
DEBUG       Loading Networking...
DEBUG         Loading Platform...
DEBUG       Using Networking loaded from state file
DEBUG       Loading Pull Secret...
DEBUG       Using Pull Secret loaded from state file
DEBUG       Loading Platform...
DEBUG     Loading Install Config from both state file and target directory
DEBUG     Using Install Config loaded from target directory
DEBUG   Loading Platform Credentials Check...
DEBUG     Loading Install Config...
DEBUG   Loading Install Config...
DEBUG   Loading Image...
DEBUG     Loading Install Config...
DEBUG   Loading Master Ignition Config...
DEBUG     Loading Install Config...
DEBUG     Loading Root CA...
DEBUG   Fetching Cluster ID...
DEBUG     Fetching Install Config...
DEBUG     Reusing previously-fetched Install Config
DEBUG   Generating Cluster ID...
DEBUG   Fetching Platform Credentials Check...
DEBUG     Fetching Install Config...
DEBUG     Reusing previously-fetched Install Config
DEBUG   Generating Platform Credentials Check...
INFO Credentials loaded from the "default" profile in file "/Users/wgordon/.aws/credentials"
DEBUG   Fetching Install Config...
DEBUG   Reusing previously-fetched Install Config
DEBUG   Fetching Image...
DEBUG     Fetching Install Config...
DEBUG     Reusing previously-fetched Install Config
DEBUG   Generating Image...
DEBUG   Fetching Master Ignition Config...
DEBUG     Fetching Install Config...
DEBUG     Reusing previously-fetched Install Config
DEBUG     Fetching Root CA...
DEBUG     Generating Root CA...
DEBUG   Generating Master Ignition Config...
DEBUG Generating Master Machines...
```

Comment 3 Matthew Staebler 2021-11-08 17:36:49 UTC
A possible resolution for this would be to fetch the regions from AWS in order to perform validation that the region chosen is actually an AWS region. The only value in that would be to avoid the 1-minute DNS resolution timeout. I suspect that we will never get around to making this fix.

Comment 4 Matthew Staebler 2021-11-08 17:39:00 UTC
(In reply to Matthew Staebler from comment #3)
> A possible resolution for this would be to fetch the regions from AWS in
> order to perform validation that the region chosen is actually an AWS
> region. The only value in that would be to avoid the 1-minute DNS resolution
> timeout. I suspect that we will never get around to making this fix.

I misread the title. This is a 60 *minute* timeout? Could you please attach the logs from the installer? It seems like there may be retries happening that should not be.

Comment 5 Matthew Staebler 2021-11-08 19:38:34 UTC
The length of the timeout is due to the installer allowing for 25 retries before failing. Presumably there is some backoff with each retry. Changing the max retries to 3 allows the installer to fail in 1 second.

Comment 6 Will Gordon 2021-11-08 23:16:40 UTC
Created attachment 1840769 [details]
install log

Comment 8 Yunfei Jiang 2021-12-27 06:47:51 UTC
verified. PASS.
OCP Version: 4.10.0-0.nightly-2021-12-25-025639

time="2021-12-27T01:33:25-05:00" level=debug msg="OpenShift Installer 4.10.0-0.nightly-2021-12-25-025639"
time="2021-12-27T01:33:25-05:00" level=debug msg="Built from commit 37c09290190e5dd00446b603c32792c06d205b62"
time="2021-12-27T01:33:25-05:00" level=debug msg="Fetching Metadata..."
time="2021-12-27T01:33:25-05:00" level=debug msg="Loading Metadata..."
time="2021-12-27T01:33:25-05:00" level=debug msg="  Loading Cluster ID..."
time="2021-12-27T01:33:25-05:00" level=debug msg="    Loading Install Config..."
time="2021-12-27T01:33:25-05:00" level=debug msg="      Loading SSH Key..."
time="2021-12-27T01:33:25-05:00" level=debug msg="      Loading Base Domain..."
time="2021-12-27T01:33:25-05:00" level=debug msg="        Loading Platform..."
time="2021-12-27T01:33:25-05:00" level=debug msg="      Loading Cluster Name..."
time="2021-12-27T01:33:25-05:00" level=debug msg="        Loading Base Domain..."
time="2021-12-27T01:33:25-05:00" level=debug msg="        Loading Platform..."
time="2021-12-27T01:33:25-05:00" level=debug msg="      Loading Networking..."
time="2021-12-27T01:33:25-05:00" level=debug msg="        Loading Platform..."
time="2021-12-27T01:33:25-05:00" level=debug msg="      Loading Pull Secret..."
time="2021-12-27T01:33:25-05:00" level=debug msg="      Loading Platform..."
time="2021-12-27T01:33:25-05:00" level=info msg="Credentials loaded from the \"default\" profile in file \"/home/cloud-user/.aws/credentials\""
time="2021-12-27T01:33:25-05:00" level=fatal msg="failed to fetch Metadata: failed to load asset \"Install Config\": platform.aws.serviceEndpoints.region: Invalid value: \"us-xyzz-2\": dial tcp: lookup ec2.us-xyzz-2.amazonaws.com on 10.11.5.19:53: no such host"

Comment 12 errata-xmlrpc 2022-03-10 16:24:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.