Bug 2065510 - [AWS] failed to create cluster on ap-southeast-3
Summary: [AWS] failed to create cluster on ap-southeast-3
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Joel Speed
QA Contact: Huali Liu
URL:
Whiteboard:
: 2064723 (view as bug list)
Depends On:
Blocks: 2109124
TreeView+ depends on / blocked
 
Reported: 2022-03-18 02:51 UTC by Yunfei Jiang
Modified: 2022-08-10 10:55 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: AWS contains a list of known regions within their SDK, we were using a strict mode within the SDK that validated that the region was known and would throw an error otherwise Consequence: This meant that, as new regions were added, they could not be used until the SDK was updated to contain the new region information Fix: We now use a less strict setting and warn the user when we do not recognise the region Result: New regions can now be used immediately, though may cause spurious warnings
Clone Of:
: 2109124 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:54:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-provider-aws pull 29 0 None open Bug 2065510: Update AWS SDK to 1.43.20 2022-03-18 12:15:34 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:55:04 UTC

Description Yunfei Jiang 2022-03-18 02:51:50 UTC
AWS Jakarta (ap-southeast-3) region has opened now, but failed to install OCP on this region:
 
> oc logs -n openshift-machine-api machine-api-controllers-7f945cbc56-8hjvl  -c machine-controller

E0318 02:00:28.875146       1 controller.go:303] yunjiang-se2-42qt8-worker-ap-southeast-3b-pt655: failed to check if machine exists: yunjiang-se2-42qt8-worker-ap-southeast-3b-pt655: failed to create scope for machine: failed to create aws client: region "ap-southeast-3" not resolved: UnknownEndpointError: could not resolve endpoint
	partition: "all partitions", service: "ec2", region: "ap-southeast-3"
E0318 02:00:28.884895       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="yunjiang-se2-42qt8-worker-ap-southeast-3b-pt655: failed to create scope for machine: failed to create aws client: region \"ap-southeast-3\" not resolved: UnknownEndpointError: could not resolve endpoint\n\tpartition: \"all partitions\", service: \"ec2\", region: \"ap-southeast-3\"" "name"="yunjiang-se2-42qt8-worker-ap-southeast-3b-pt655" "namespace"="openshift-machine-api"

[1] https://aws.amazon.com/blogs/aws/now-open-aws-asia-pacific-jakarta-region/


Version-Release number of the following components: 
4.10.5

How reproducible: 
Always 
 

Steps to Reproduce: 
1. Create an IPI cluster on ap-southeast-3

Actual results: 
Compute node can not be created, install failed. 

Expected results:
CLuster successfully installed on ap-southeast-3

Additional info:

Comment 1 dmoiseev 2022-03-18 14:25:19 UTC
*** Bug 2064723 has been marked as a duplicate of this bug. ***

Comment 4 Okky Hendriansyah Tri Firgantoro 2022-03-20 02:18:36 UTC
Hi Team, 

I have tried to install OCP 4.10.3 IPI on ap-southeast-3 and it went successful.

Initially with default installation approach (assuming it is already supported) the installation went failure on worker node creation. It turned out that ap-southeast-3 support in AWS SDK just released [1] a day after the region is officially announced by AWS [2]. Since openshift-installer vendors aws-sdk it seems that the specific commit including the region is not yet included in OCP 4.10 GA timeframe, thus the installer has no visibility of ap-southeast-3 region.

So after I learn about the possibilities on overriding the AWS Service Endpoints [3] and all of the possible services required in installation [4], I have successfully created the cluster using this approach:
1. Defining custom AWS Service Endpoints for ap-southeast-3
2. Upload RHCOS 4.10.3 AWS VMDK image to my AWS S3 bucket
3. Register the RHCOS 4.10.3 to the region AMI (because without that MachineSet scaling will not work with the error mentioning amiID not found)
4. Update the install-config.yaml

I have also tested a brief functionalities of the cluster that works:
1. Installation finished around normal duration (30-40 minutes)
2. Registry bucket created and S2I builds are successful (pushing/pulling to/from registry)
3. MachineSet scaling
4. AWS EBS CSI and built-in volume consumption tested using CrunchyData workload
5. Google login authentication (using @redhat.com email)
6. Patch release cluster upgrade from 4.10.3 to 4.10.4 using fast-4.10 channel

What I haven't tested and I'm going to test is the minor upgrade (e.g. 4.8 to 4.9) with only upload and specifying OCP 4.8 AMI. This hopefully replicates the condition when in the future customer would like to upgrade the cluster to 4.11. 

Here is the snippet in install-config.yaml that I used:

...
platform:
  aws:
    region: ap-southeast-3
    userTags:
      adminContact: otrifirg
      costCenter: 420
      activity: poc
      customer: ptbc
    amiID: ami-0db1bdfbb0592c159
    serviceEndpoints: 
      - name: ec2
        url: https://ec2.ap-southeast-3.amazonaws.com
      - name: elasticloadbalancing
        url: https://elasticloadbalancing.ap-southeast-3.amazonaws.com
      - name: s3
        url: https://s3.ap-southeast-3.amazonaws.com
      - name: autoscaling
        url: https://autoscaling.ap-southeast-3.amazonaws.com
      - name: servicequotas
        url: https://servicequotas.ap-southeast-3.amazonaws.com
      - name: sts
        url: https://sts.ap-southeast-3.amazonaws.com
      - name: kms
        url: https://kms.ap-southeast-3.amazonaws.com
...

[1] https://github.com/aws/aws-sdk-go/blob/v1.42.23/models/endpoints/endpoints.json
[2] https://aws.amazon.com/blogs/aws/now-open-aws-asia-pacific-jakarta-region/
[3] https://docs.openshift.com/container-platform/4.8/installing/installing_aws/installing-aws-account.html#nw-endpoint-route53_installing-aws-account
[4] https://docs.openshift.com/container-platform/4.8/installing/installing_aws/installing-aws-account.html#installation-aws-permissions_installing-aws-account
[5] https://access.redhat.com/documentation/en-us/openshift_container_platform/4.10/html/installing/installing-on-aws#installation-aws-user-infra-rhcos-ami_installing-aws-user-infra
[4] https://access.redhat.com/documentation/en-us/openshift_container_platform/4.10/html/installing/installing-on-aws#installation-aws-upload-custom-rhcos-ami_installing-aws-user-infra

Comment 5 Okky Hendriansyah Tri Firgantoro 2022-03-20 07:42:37 UTC
Hi,

An update on the region. Using the same approach as before, I managed to deploy OCP 4.8.14 with RHCOS 4.8.14. After that I upgrade the cluster to OCP 4.9.23 using stable-4.9 channel and it can be upgraded after almost 2 hours of upgrading 3 masters and 3 workers. This is without uploading and registering first the RHCOS 4.9.x AMI to ap-southeast-3 region. The MachineSet scaling is also still works.

Comment 6 Huali Liu 2022-03-22 05:12:17 UTC
Reproduce the issue on 4.10.5
steps:
1. Create an IPI cluster on ap-southeast-3
2. Cluster install failed, check the cluster
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          65m     Unable to apply 4.10.5: some cluster operators have not yet rolled out
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                              PHASE   TYPE   REGION   ZONE   AGE
huliu-aws114-s46zx-master-0                                                      72m
huliu-aws114-s46zx-master-1                                                      72m
huliu-aws114-s46zx-master-2                                                      72m
huliu-aws114-s46zx-worker-ap-southeast-3a-cvhxz                                  63m
huliu-aws114-s46zx-worker-ap-southeast-3b-hzl8h                                  63m
huliu-aws114-s46zx-worker-ap-southeast-3c-tn5rn                                  63m
liuhuali@Lius-MacBook-Pro huali-test % oc get machineset 
NAME                                        DESIRED   CURRENT   READY   AVAILABLE   AGE
huliu-aws114-s46zx-worker-ap-southeast-3a   1         1                             72m
huliu-aws114-s46zx-worker-ap-southeast-3b   1         1                             72m
huliu-aws114-s46zx-worker-ap-southeast-3c   1         1                             72m
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                              STATUS   ROLES    AGE   VERSION
ip-10-0-138-63.ap-southeast-3.compute.internal    Ready    master   69m   v1.23.3+e419edf
ip-10-0-161-100.ap-southeast-3.compute.internal   Ready    master   69m   v1.23.3+e419edf
ip-10-0-203-113.ap-southeast-3.compute.internal   Ready    master   69m   v1.23.3+e419edf
liuhuali@Lius-MacBook-Pro huali-test % oc logs machine-api-controllers-7f945cbc56-ssb26 -c machine-controller
...
E0322 03:24:51.944980       1 controller.go:303] huliu-aws114-s46zx-worker-ap-southeast-3a-cvhxz: failed to check if machine exists: huliu-aws114-s46zx-worker-ap-southeast-3a-cvhxz: failed to create scope for machine: failed to create aws client: region "ap-southeast-3" not resolved: UnknownEndpointError: could not resolve endpoint
	partition: "all partitions", service: "ec2", region: "ap-southeast-3"
E0322 03:24:51.952173       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="huliu-aws114-s46zx-worker-ap-southeast-3a-cvhxz: failed to create scope for machine: failed to create aws client: region \"ap-southeast-3\" not resolved: UnknownEndpointError: could not resolve endpoint\n\tpartition: \"all partitions\", service: \"ec2\", region: \"ap-southeast-3\"" "name"="huliu-aws114-s46zx-worker-ap-southeast-3a-cvhxz" "namespace"="openshift-machine-api" 


Verified on 4.11.0-0.nightly-2022-03-20-160505, compute node created successfully, no "failed to create scope for machine" error in machine-controller log, although the cluster still install failed, but it's due to  https://bugzilla.redhat.com/show_bug.cgi?id=2065552, move this to Verified.
1. Create an IPI cluster on ap-southeast-3
2. Cluster install failed, check the cluster
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          59m     Unable to apply 4.11.0-0.nightly-2022-03-20-160505: the cluster operator image-registry has not yet successfully rolled out
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                              PHASE     TYPE        REGION           ZONE              AGE
huliu-aws116-ftcts-master-0                       Running   m5.xlarge   ap-southeast-3   ap-southeast-3a   60m
huliu-aws116-ftcts-master-1                       Running   m5.xlarge   ap-southeast-3   ap-southeast-3b   60m
huliu-aws116-ftcts-master-2                       Running   m5.xlarge   ap-southeast-3   ap-southeast-3c   60m
huliu-aws116-ftcts-worker-ap-southeast-3a-bf859   Running   m5.large    ap-southeast-3   ap-southeast-3a   51m
huliu-aws116-ftcts-worker-ap-southeast-3b-9zxmz   Running   m5.large    ap-southeast-3   ap-southeast-3b   51m
huliu-aws116-ftcts-worker-ap-southeast-3c-plhv9   Running   m5.large    ap-southeast-3   ap-southeast-3c   51m
liuhuali@Lius-MacBook-Pro huali-test % oc get machineset
NAME                                        DESIRED   CURRENT   READY   AVAILABLE   AGE
huliu-aws116-ftcts-worker-ap-southeast-3a   1         1         1       1           60m
huliu-aws116-ftcts-worker-ap-southeast-3b   1         1         1       1           60m
huliu-aws116-ftcts-worker-ap-southeast-3c   1         1         1       1           60m
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                              STATUS   ROLES    AGE   VERSION
ip-10-0-129-157.ap-southeast-3.compute.internal   Ready    worker   43m   v1.23.3+02aefbf
ip-10-0-143-131.ap-southeast-3.compute.internal   Ready    master   57m   v1.23.3+02aefbf
ip-10-0-161-81.ap-southeast-3.compute.internal    Ready    worker   43m   v1.23.3+02aefbf
ip-10-0-165-16.ap-southeast-3.compute.internal    Ready    master   57m   v1.23.3+02aefbf
ip-10-0-219-146.ap-southeast-3.compute.internal   Ready    master   57m   v1.23.3+02aefbf
ip-10-0-221-0.ap-southeast-3.compute.internal     Ready    worker   45m   v1.23.3+02aefbf
liuhuali@Lius-MacBook-Pro huali-test % oc logs machine-api-controllers-659f4bdc66-5qtvh  -c machine-controller |grep "failed to create scope for machine"
liuhuali@Lius-MacBook-Pro huali-test %

Comment 8 errata-xmlrpc 2022-08-10 10:54:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.