Bug 1878655

Summary: [aws-custom-region] creating manifests take too much time when custom endpoint is unreachable
Product: OpenShift Container Platform Reporter: Yunfei Jiang <yunjiang>
Component: InstallerAssignee: Jeremiah Stuever <jstuever>
Installer sub component: openshift-installer QA Contact: Pedro Amoedo <pamoedom>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: adahiya, jstuever, mstaeble, pamoedom
Version: 4.6Keywords: Reopened
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:32:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yunfei Jiang 2020-09-14 09:56:39 UTC
Description of problem:

If an unreachable endpoint is provided in install-config.yaml, the fatal message appears after approximately 1 hour, it takes too much time.

Install-config:
<—snip—>
platform:
  aws:
    serviceEndpoints:
    - name: ec2
      url: https://unreachable.us-gov-west-1.amazonaws.com
<—snip—>
 

time="2020-09-14T06:22:46Z" level=info msg="Credentials loaded from the \"default\" profile in file \"/home/ec2-user/.aws/credentials\""
time="2020-09-14T06:22:46Z" level=debug msg="resolved AWS service ec2 (us-gov-west-1) to \"https://unreachable.us-gov-west-1.amazonaws.com\""
time="2020-09-14T07:17:42Z" level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": platform.aws.subnets: Invalid value: []string{\"subnet-0a03cf26376582e75\", \"subnet-08d190dab7258fc19\"}: describing subnets: RequestError: send request failed\ncaused by: Post \"https://unreachable.us-gov-west-1.amazonaws.com/\": dial tcp: lookup unreachable.us-gov-west-1.amazonaws.com on 10.0.0.2:53: no such host"


Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-04-210224

How reproducible:
100%

Steps to Reproduce:
1. create install-config, set region to us-gov-west-1, add a invalid service endpoint:
<—snip—>
platform:
  aws:
    serviceEndpoints:
    - name: ec2
      url: https://unreachable.us-gov-west-1.amazonaws.com
<—snip—>
2. Create manifests
3.

Actual results:
time="2020-09-14T06:22:46Z" level=debug msg="resolved AWS service ec2 (us-gov-west-1) to \"https://unreachable.us-gov-west-1.amazonaws.com\""
time="2020-09-14T07:17:42Z" level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": platform.aws.subnets: Invalid value: []string{\"subnet-0a03cf26376582e75\", \"subnet-08d190dab7258fc19\"}: describing subnets: RequestError: send request failed\ncaused by: Post \"https://unreachable.us-gov-west-1.amazonaws.com/\": dial tcp: lookup unreachable.us-gov-west-1.amazonaws.com on 10.0.0.2:53: no such host"

Expected results:
Report error within a reasonable time, maybe 5 minutes.

Additional info:

Comment 1 Abhinav Dahiya 2020-09-14 19:09:42 UTC
The installer uses the AWS SDK to retry failing requests. The SDK treats most of the network failures as retry able. And we can't fail fast because there are definitely cases where the dns might be glitching.
So I do not think this is a bug that needs to be fixes.

If users are setting up invalid endpoints testing for _invalidity_ is a rabbit hole and we can decide to validate certain cases.
- Like is the endpoint reachable at all, can you resolve and connect at least. anything more substantial like is it actually an ec2 endpoint are not really possible.

Secondly, the AWS service endpoints can be setup such that access is allowed / rejected based on certain parameters in the requests like source network, or constraint tags. So this is not as clear.

tracking in https://issues.redhat.com/browse/CORS-1546

Comment 2 Jeremiah Stuever 2021-04-19 18:10:35 UTC
*** Bug 1948036 has been marked as a duplicate of this bug. ***

Comment 5 Pedro Amoedo 2021-04-30 16:16:23 UTC
[QA Summary]

[Installer Version]

~~~
17:11:34  ./openshift-install 4.8.0-0.ci-2021-04-30-142055
17:11:34  built from commit da0ed4e925c093ed6e049c11d2bc68d562cc8d54
17:11:34  release image registry.ci.openshift.org/ocp/release@sha256:d2a3cc0c58f5ae31e27f8182037aee1b0ab5f91a2e2256ffbb50db20546bf97d
~~~

NOTE: Latest accepted nightly version "4.8.0-0.nightly-2021-04-30-102231" doesn't contain the commit "da0ed4e92" yet, therefore, I've used "4.8.0-0.ci-2021-04-30-142055" for testing.

[Installer Parameters]

~~~
    serviceEndpoints:
    - name: ec2
      url: https://unreachable.us-gov-west-1.amazonaws.com/
~~~

[Results]

As expected, the installation aborted now early after checking first the endpoints:

~~~
17:58:18  [INFO] Generating manifests files.....
17:58:18  level=fatal msg=failed to fetch Master Machines: failed to load asset "Install Config": platform.aws.serviceEndpoints[0].url: Invalid value: "https://unreachable.us-gov-west-1.amazonaws.com/": dial tcp: lookup unreachable.us-gov-west-1.amazonaws.com on 172.30.0.10:53: no such host
~~~

Best Regards.

Comment 8 errata-xmlrpc 2021-07-27 22:32:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438