Bug 2065510
Summary: | [AWS] failed to create cluster on ap-southeast-3 | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yunfei Jiang <yunjiang> | |
Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> | |
Cloud Compute sub component: | Other Providers | QA Contact: | Huali Liu <huliu> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | dmoiseev, ffathurr, kbater, otrifirg | |
Version: | 4.10 | |||
Target Milestone: | --- | |||
Target Release: | 4.11.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: AWS contains a list of known regions within their SDK, we were using a strict mode within the SDK that validated that the region was known and would throw an error otherwise
Consequence: This meant that, as new regions were added, they could not be used until the SDK was updated to contain the new region information
Fix: We now use a less strict setting and warn the user when we do not recognise the region
Result: New regions can now be used immediately, though may cause spurious warnings
|
Story Points: | --- | |
Clone Of: | ||||
: | 2109124 (view as bug list) | Environment: | ||
Last Closed: | 2022-08-10 10:54:38 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2109124 |
Description
Yunfei Jiang
2022-03-18 02:51:50 UTC
*** Bug 2064723 has been marked as a duplicate of this bug. *** Hi Team, I have tried to install OCP 4.10.3 IPI on ap-southeast-3 and it went successful. Initially with default installation approach (assuming it is already supported) the installation went failure on worker node creation. It turned out that ap-southeast-3 support in AWS SDK just released [1] a day after the region is officially announced by AWS [2]. Since openshift-installer vendors aws-sdk it seems that the specific commit including the region is not yet included in OCP 4.10 GA timeframe, thus the installer has no visibility of ap-southeast-3 region. So after I learn about the possibilities on overriding the AWS Service Endpoints [3] and all of the possible services required in installation [4], I have successfully created the cluster using this approach: 1. Defining custom AWS Service Endpoints for ap-southeast-3 2. Upload RHCOS 4.10.3 AWS VMDK image to my AWS S3 bucket 3. Register the RHCOS 4.10.3 to the region AMI (because without that MachineSet scaling will not work with the error mentioning amiID not found) 4. Update the install-config.yaml I have also tested a brief functionalities of the cluster that works: 1. Installation finished around normal duration (30-40 minutes) 2. Registry bucket created and S2I builds are successful (pushing/pulling to/from registry) 3. MachineSet scaling 4. AWS EBS CSI and built-in volume consumption tested using CrunchyData workload 5. Google login authentication (using @redhat.com email) 6. Patch release cluster upgrade from 4.10.3 to 4.10.4 using fast-4.10 channel What I haven't tested and I'm going to test is the minor upgrade (e.g. 4.8 to 4.9) with only upload and specifying OCP 4.8 AMI. This hopefully replicates the condition when in the future customer would like to upgrade the cluster to 4.11. Here is the snippet in install-config.yaml that I used: ... platform: aws: region: ap-southeast-3 userTags: adminContact: otrifirg costCenter: 420 activity: poc customer: ptbc amiID: ami-0db1bdfbb0592c159 serviceEndpoints: - name: ec2 url: https://ec2.ap-southeast-3.amazonaws.com - name: elasticloadbalancing url: https://elasticloadbalancing.ap-southeast-3.amazonaws.com - name: s3 url: https://s3.ap-southeast-3.amazonaws.com - name: autoscaling url: https://autoscaling.ap-southeast-3.amazonaws.com - name: servicequotas url: https://servicequotas.ap-southeast-3.amazonaws.com - name: sts url: https://sts.ap-southeast-3.amazonaws.com - name: kms url: https://kms.ap-southeast-3.amazonaws.com ... [1] https://github.com/aws/aws-sdk-go/blob/v1.42.23/models/endpoints/endpoints.json [2] https://aws.amazon.com/blogs/aws/now-open-aws-asia-pacific-jakarta-region/ [3] https://docs.openshift.com/container-platform/4.8/installing/installing_aws/installing-aws-account.html#nw-endpoint-route53_installing-aws-account [4] https://docs.openshift.com/container-platform/4.8/installing/installing_aws/installing-aws-account.html#installation-aws-permissions_installing-aws-account [5] https://access.redhat.com/documentation/en-us/openshift_container_platform/4.10/html/installing/installing-on-aws#installation-aws-user-infra-rhcos-ami_installing-aws-user-infra [4] https://access.redhat.com/documentation/en-us/openshift_container_platform/4.10/html/installing/installing-on-aws#installation-aws-upload-custom-rhcos-ami_installing-aws-user-infra Hi, An update on the region. Using the same approach as before, I managed to deploy OCP 4.8.14 with RHCOS 4.8.14. After that I upgrade the cluster to OCP 4.9.23 using stable-4.9 channel and it can be upgraded after almost 2 hours of upgrading 3 masters and 3 workers. This is without uploading and registering first the RHCOS 4.9.x AMI to ap-southeast-3 region. The MachineSet scaling is also still works. Reproduce the issue on 4.10.5 steps: 1. Create an IPI cluster on ap-southeast-3 2. Cluster install failed, check the cluster liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 65m Unable to apply 4.10.5: some cluster operators have not yet rolled out liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws114-s46zx-master-0 72m huliu-aws114-s46zx-master-1 72m huliu-aws114-s46zx-master-2 72m huliu-aws114-s46zx-worker-ap-southeast-3a-cvhxz 63m huliu-aws114-s46zx-worker-ap-southeast-3b-hzl8h 63m huliu-aws114-s46zx-worker-ap-southeast-3c-tn5rn 63m liuhuali@Lius-MacBook-Pro huali-test % oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE huliu-aws114-s46zx-worker-ap-southeast-3a 1 1 72m huliu-aws114-s46zx-worker-ap-southeast-3b 1 1 72m huliu-aws114-s46zx-worker-ap-southeast-3c 1 1 72m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-138-63.ap-southeast-3.compute.internal Ready master 69m v1.23.3+e419edf ip-10-0-161-100.ap-southeast-3.compute.internal Ready master 69m v1.23.3+e419edf ip-10-0-203-113.ap-southeast-3.compute.internal Ready master 69m v1.23.3+e419edf liuhuali@Lius-MacBook-Pro huali-test % oc logs machine-api-controllers-7f945cbc56-ssb26 -c machine-controller ... E0322 03:24:51.944980 1 controller.go:303] huliu-aws114-s46zx-worker-ap-southeast-3a-cvhxz: failed to check if machine exists: huliu-aws114-s46zx-worker-ap-southeast-3a-cvhxz: failed to create scope for machine: failed to create aws client: region "ap-southeast-3" not resolved: UnknownEndpointError: could not resolve endpoint partition: "all partitions", service: "ec2", region: "ap-southeast-3" E0322 03:24:51.952173 1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="huliu-aws114-s46zx-worker-ap-southeast-3a-cvhxz: failed to create scope for machine: failed to create aws client: region \"ap-southeast-3\" not resolved: UnknownEndpointError: could not resolve endpoint\n\tpartition: \"all partitions\", service: \"ec2\", region: \"ap-southeast-3\"" "name"="huliu-aws114-s46zx-worker-ap-southeast-3a-cvhxz" "namespace"="openshift-machine-api" Verified on 4.11.0-0.nightly-2022-03-20-160505, compute node created successfully, no "failed to create scope for machine" error in machine-controller log, although the cluster still install failed, but it's due to https://bugzilla.redhat.com/show_bug.cgi?id=2065552, move this to Verified. 1. Create an IPI cluster on ap-southeast-3 2. Cluster install failed, check the cluster liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 59m Unable to apply 4.11.0-0.nightly-2022-03-20-160505: the cluster operator image-registry has not yet successfully rolled out liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws116-ftcts-master-0 Running m5.xlarge ap-southeast-3 ap-southeast-3a 60m huliu-aws116-ftcts-master-1 Running m5.xlarge ap-southeast-3 ap-southeast-3b 60m huliu-aws116-ftcts-master-2 Running m5.xlarge ap-southeast-3 ap-southeast-3c 60m huliu-aws116-ftcts-worker-ap-southeast-3a-bf859 Running m5.large ap-southeast-3 ap-southeast-3a 51m huliu-aws116-ftcts-worker-ap-southeast-3b-9zxmz Running m5.large ap-southeast-3 ap-southeast-3b 51m huliu-aws116-ftcts-worker-ap-southeast-3c-plhv9 Running m5.large ap-southeast-3 ap-southeast-3c 51m liuhuali@Lius-MacBook-Pro huali-test % oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE huliu-aws116-ftcts-worker-ap-southeast-3a 1 1 1 1 60m huliu-aws116-ftcts-worker-ap-southeast-3b 1 1 1 1 60m huliu-aws116-ftcts-worker-ap-southeast-3c 1 1 1 1 60m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-129-157.ap-southeast-3.compute.internal Ready worker 43m v1.23.3+02aefbf ip-10-0-143-131.ap-southeast-3.compute.internal Ready master 57m v1.23.3+02aefbf ip-10-0-161-81.ap-southeast-3.compute.internal Ready worker 43m v1.23.3+02aefbf ip-10-0-165-16.ap-southeast-3.compute.internal Ready master 57m v1.23.3+02aefbf ip-10-0-219-146.ap-southeast-3.compute.internal Ready master 57m v1.23.3+02aefbf ip-10-0-221-0.ap-southeast-3.compute.internal Ready worker 45m v1.23.3+02aefbf liuhuali@Lius-MacBook-Pro huali-test % oc logs machine-api-controllers-659f4bdc66-5qtvh -c machine-controller |grep "failed to create scope for machine" liuhuali@Lius-MacBook-Pro huali-test % Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |