Description of problem: If an unreachable endpoint is provided in install-config.yaml, the fatal message appears after approximately 1 hour, it takes too much time. Install-config: <—snip—> platform: aws: serviceEndpoints: - name: ec2 url: https://unreachable.us-gov-west-1.amazonaws.com <—snip—> time="2020-09-14T06:22:46Z" level=info msg="Credentials loaded from the \"default\" profile in file \"/home/ec2-user/.aws/credentials\"" time="2020-09-14T06:22:46Z" level=debug msg="resolved AWS service ec2 (us-gov-west-1) to \"https://unreachable.us-gov-west-1.amazonaws.com\"" time="2020-09-14T07:17:42Z" level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": platform.aws.subnets: Invalid value: []string{\"subnet-0a03cf26376582e75\", \"subnet-08d190dab7258fc19\"}: describing subnets: RequestError: send request failed\ncaused by: Post \"https://unreachable.us-gov-west-1.amazonaws.com/\": dial tcp: lookup unreachable.us-gov-west-1.amazonaws.com on 10.0.0.2:53: no such host" Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-08-04-210224 How reproducible: 100% Steps to Reproduce: 1. create install-config, set region to us-gov-west-1, add a invalid service endpoint: <—snip—> platform: aws: serviceEndpoints: - name: ec2 url: https://unreachable.us-gov-west-1.amazonaws.com <—snip—> 2. Create manifests 3. Actual results: time="2020-09-14T06:22:46Z" level=debug msg="resolved AWS service ec2 (us-gov-west-1) to \"https://unreachable.us-gov-west-1.amazonaws.com\"" time="2020-09-14T07:17:42Z" level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": platform.aws.subnets: Invalid value: []string{\"subnet-0a03cf26376582e75\", \"subnet-08d190dab7258fc19\"}: describing subnets: RequestError: send request failed\ncaused by: Post \"https://unreachable.us-gov-west-1.amazonaws.com/\": dial tcp: lookup unreachable.us-gov-west-1.amazonaws.com on 10.0.0.2:53: no such host" Expected results: Report error within a reasonable time, maybe 5 minutes. Additional info:
The installer uses the AWS SDK to retry failing requests. The SDK treats most of the network failures as retry able. And we can't fail fast because there are definitely cases where the dns might be glitching. So I do not think this is a bug that needs to be fixes. If users are setting up invalid endpoints testing for _invalidity_ is a rabbit hole and we can decide to validate certain cases. - Like is the endpoint reachable at all, can you resolve and connect at least. anything more substantial like is it actually an ec2 endpoint are not really possible. Secondly, the AWS service endpoints can be setup such that access is allowed / rejected based on certain parameters in the requests like source network, or constraint tags. So this is not as clear. tracking in https://issues.redhat.com/browse/CORS-1546
*** Bug 1948036 has been marked as a duplicate of this bug. ***
[QA Summary] [Installer Version] ~~~ 17:11:34 ./openshift-install 4.8.0-0.ci-2021-04-30-142055 17:11:34 built from commit da0ed4e925c093ed6e049c11d2bc68d562cc8d54 17:11:34 release image registry.ci.openshift.org/ocp/release@sha256:d2a3cc0c58f5ae31e27f8182037aee1b0ab5f91a2e2256ffbb50db20546bf97d ~~~ NOTE: Latest accepted nightly version "4.8.0-0.nightly-2021-04-30-102231" doesn't contain the commit "da0ed4e92" yet, therefore, I've used "4.8.0-0.ci-2021-04-30-142055" for testing. [Installer Parameters] ~~~ serviceEndpoints: - name: ec2 url: https://unreachable.us-gov-west-1.amazonaws.com/ ~~~ [Results] As expected, the installation aborted now early after checking first the endpoints: ~~~ 17:58:18 [INFO] Generating manifests files..... 17:58:18 level=fatal msg=failed to fetch Master Machines: failed to load asset "Install Config": platform.aws.serviceEndpoints[0].url: Invalid value: "https://unreachable.us-gov-west-1.amazonaws.com/": dial tcp: lookup unreachable.us-gov-west-1.amazonaws.com on 172.30.0.10:53: no such host ~~~ Best Regards.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438