Bug 1878655 - [aws-custom-region] creating manifests take too much time when custom endpoint is unreachable
Summary: [aws-custom-region] creating manifests take too much time when custom endpoin...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Jeremiah Stuever
QA Contact: Pedro Amoedo
URL:
Whiteboard:
: 1948036 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-14 09:56 UTC by Yunfei Jiang
Modified: 2021-07-27 22:33 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:32:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4805 0 None open Bug 1878655: aws installconfig: endpoint validation should be before others 2021-04-19 18:16:36 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:33:19 UTC

Description Yunfei Jiang 2020-09-14 09:56:39 UTC
Description of problem:

If an unreachable endpoint is provided in install-config.yaml, the fatal message appears after approximately 1 hour, it takes too much time.

Install-config:
<—snip—>
platform:
  aws:
    serviceEndpoints:
    - name: ec2
      url: https://unreachable.us-gov-west-1.amazonaws.com
<—snip—>
 

time="2020-09-14T06:22:46Z" level=info msg="Credentials loaded from the \"default\" profile in file \"/home/ec2-user/.aws/credentials\""
time="2020-09-14T06:22:46Z" level=debug msg="resolved AWS service ec2 (us-gov-west-1) to \"https://unreachable.us-gov-west-1.amazonaws.com\""
time="2020-09-14T07:17:42Z" level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": platform.aws.subnets: Invalid value: []string{\"subnet-0a03cf26376582e75\", \"subnet-08d190dab7258fc19\"}: describing subnets: RequestError: send request failed\ncaused by: Post \"https://unreachable.us-gov-west-1.amazonaws.com/\": dial tcp: lookup unreachable.us-gov-west-1.amazonaws.com on 10.0.0.2:53: no such host"


Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-04-210224

How reproducible:
100%

Steps to Reproduce:
1. create install-config, set region to us-gov-west-1, add a invalid service endpoint:
<—snip—>
platform:
  aws:
    serviceEndpoints:
    - name: ec2
      url: https://unreachable.us-gov-west-1.amazonaws.com
<—snip—>
2. Create manifests
3.

Actual results:
time="2020-09-14T06:22:46Z" level=debug msg="resolved AWS service ec2 (us-gov-west-1) to \"https://unreachable.us-gov-west-1.amazonaws.com\""
time="2020-09-14T07:17:42Z" level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": platform.aws.subnets: Invalid value: []string{\"subnet-0a03cf26376582e75\", \"subnet-08d190dab7258fc19\"}: describing subnets: RequestError: send request failed\ncaused by: Post \"https://unreachable.us-gov-west-1.amazonaws.com/\": dial tcp: lookup unreachable.us-gov-west-1.amazonaws.com on 10.0.0.2:53: no such host"

Expected results:
Report error within a reasonable time, maybe 5 minutes.

Additional info:

Comment 1 Abhinav Dahiya 2020-09-14 19:09:42 UTC
The installer uses the AWS SDK to retry failing requests. The SDK treats most of the network failures as retry able. And we can't fail fast because there are definitely cases where the dns might be glitching.
So I do not think this is a bug that needs to be fixes.

If users are setting up invalid endpoints testing for _invalidity_ is a rabbit hole and we can decide to validate certain cases.
- Like is the endpoint reachable at all, can you resolve and connect at least. anything more substantial like is it actually an ec2 endpoint are not really possible.

Secondly, the AWS service endpoints can be setup such that access is allowed / rejected based on certain parameters in the requests like source network, or constraint tags. So this is not as clear.

tracking in https://issues.redhat.com/browse/CORS-1546

Comment 2 Jeremiah Stuever 2021-04-19 18:10:35 UTC
*** Bug 1948036 has been marked as a duplicate of this bug. ***

Comment 5 Pedro Amoedo 2021-04-30 16:16:23 UTC
[QA Summary]

[Installer Version]

~~~
17:11:34  ./openshift-install 4.8.0-0.ci-2021-04-30-142055
17:11:34  built from commit da0ed4e925c093ed6e049c11d2bc68d562cc8d54
17:11:34  release image registry.ci.openshift.org/ocp/release@sha256:d2a3cc0c58f5ae31e27f8182037aee1b0ab5f91a2e2256ffbb50db20546bf97d
~~~

NOTE: Latest accepted nightly version "4.8.0-0.nightly-2021-04-30-102231" doesn't contain the commit "da0ed4e92" yet, therefore, I've used "4.8.0-0.ci-2021-04-30-142055" for testing.

[Installer Parameters]

~~~
    serviceEndpoints:
    - name: ec2
      url: https://unreachable.us-gov-west-1.amazonaws.com/
~~~

[Results]

As expected, the installation aborted now early after checking first the endpoints:

~~~
17:58:18  [INFO] Generating manifests files.....
17:58:18  level=fatal msg=failed to fetch Master Machines: failed to load asset "Install Config": platform.aws.serviceEndpoints[0].url: Invalid value: "https://unreachable.us-gov-west-1.amazonaws.com/": dial tcp: lookup unreachable.us-gov-west-1.amazonaws.com on 172.30.0.10:53: no such host
~~~

Best Regards.

Comment 8 errata-xmlrpc 2021-07-27 22:32:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.