Bug 1949626
| Summary: | machine-api fails to create AWS client in new regions | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Matthew Staebler <mstaeble> | ||||
| Component: | Cloud Compute | Assignee: | Michael McCune <mimccune> | ||||
| Cloud Compute sub component: | Other Providers | QA Contact: | Milind Yadav <miyadav> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | unspecified | CC: | mgugino, mimccune, yunjiang | ||||
| Version: | 4.8 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.8.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1976192 1976198 (view as bug list) | Environment: | |||||
| Last Closed: | 2021-07-27 23:00:48 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1976192 | ||||||
| Attachments: |
|
||||||
just starting to take a look at this @mstaeble do you happen to have a log for the machine controller? i see a few places in the code where this might be coming from, but if you still have the logs handy it would make it easier. if not, i will re-create the error condition. actually, i don't think it will be needed. i think this just requires us to update the aws-sdk-go that the actuator is using. i see the "ap-northeast-3" only appears in a special fips version in the code we are running, but the new version has much wider support for that zone. i am updating the sdk now and running tests. Updating the aws-sdk-go will work for this particular region, but it does not solve the long-term issue of needing to support new regions as they come out without needing to update the code. Created attachment 1774149 [details]
machine-controller logs
Here are the logs from the machine-controller.
(In reply to Matthew Staebler from comment #4) > Updating the aws-sdk-go will work for this particular region, but it does > not solve the long-term issue of needing to support new regions as they come > out without needing to update the code. Expanding on this, we will want to be able to support new regions in OpenShift 4.6. And, ideally, we would not need to wait for a new z release to add support. i will need to dig into the aws-sdk-go code a little more, but my first impression is that it does validation internally for these region names. if that is the case, then our choices will be limited for how we handle this, perhaps there is a way to make the client pass-through the region without validation, or maybe we could propose an upstream change. unless there is another option that i'm just not thinking about. i will continue to investigate the code, happy to entertain suggestions though. =) i talked with Matthew on slack, there is another option here presented by the sdk. i am going to talk with the team about switching our implementation to use the other client methods so that we can avoid the update/backport cycle. We already have the ability to specify custom endpoints for a region, we had to implement this for govcloud. Whatever the instructions are for setting up govcloud will work for new regions as far as I can tell. Matt, i have converted our conversation about improving this functionality into a jira card. our team will plan this work for the next cycle, thanks for your suggestions around how to make it better =) https://issues.redhat.com/browse/OCPCLOUD-1159 Valiated on :4.8.0-0.nightly-2021-05-06-210840
Based on below , moving to VERIFIED.
Steps : Installed a IPI AWS cluster with region as 'AWS_REGION=ap-northeast-3' , all nodes and machines were as expected .
[miyadav@miyadav ~]$ oc get machines -A -o json | yq -y '.items[] | select(.metadata.name | contains("worker-ap-northeast-3a")) | .status.conditions'
- lastTransitionTime: '2021-05-07T05:59:03Z'
status: 'True'
type: InstanceExists
.
.
[miyadav@miyadav ~]$ oc get machines
ocNAME PHASE TYPE REGION ZONE AGE
miyadav-07-dl4cf-master-0 Running m5.xlarge ap-northeast-3 ap-northeast-3a 78m
miyadav-07-dl4cf-master-1 Running m5.xlarge ap-northeast-3 ap-northeast-3b 78m
miyadav-07-dl4cf-master-2 Running m5.xlarge ap-northeast-3 ap-northeast-3c 78m
miyadav-07-dl4cf-worker-ap-northeast-3a-kmzmp Running m5.large ap-northeast-3 ap-northeast-3a 69m
miyadav-07-dl4cf-worker-ap-northeast-3b-lglrd Running m5.large ap-northeast-3 ap-northeast-3b 69m
miyadav-07-dl4cf-worker-ap-northeast-3c-q2tpf Running m5.large ap-northeast-3 ap-northeast-3c 69m
.
.
[miyadav@miyadav ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-139-223.ap-northeast-3.compute.internal Ready master 74m v1.21.0-rc.0+291e731
ip-10-0-141-96.ap-northeast-3.compute.internal Ready worker 64m v1.21.0-rc.0+291e731
ip-10-0-162-197.ap-northeast-3.compute.internal Ready master 74m v1.21.0-rc.0+291e731
ip-10-0-178-142.ap-northeast-3.compute.internal Ready worker 64m v1.21.0-rc.0+291e731
ip-10-0-206-140.ap-northeast-3.compute.internal Ready master 74m v1.21.0-rc.0+291e731
ip-10-0-217-179.ap-northeast-3.compute.internal Ready worker 64m v1.21.0-rc.0+291e731
[miyadav@miyadav ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
Description of problem: When using a new region that is unknown to the AWS SDK, the machine-api fails to create an AWS client. How reproducible: 100% Steps to Reproduce: 1. Install a 4.8 cluster in the ap-northeast-3 region. 2. See that the Machines report InstanceExists failures. Actual results: oc get machines -A -oyaml | yq -y '.items[] | select(.metadata.name | contains("worker-ap-northeast-3a")) | .status.conditions' - lastTransitionTime: '2021-04-14T16:11:24Z' message: "Failed to check if machine exists: mstaeble-mgw8b-worker-ap-northeast-3a-c2b5q:\ \ failed to create scope for machine: failed to create aws client: region \"ap-northeast-3\"\ \ not resolved: UnknownEndpointError: could not resolve endpoint\n\tpartition:\ \ \"all partitions\", service: \"ec2\", region: \"ap-northeast-3\"" reason: ErrorCheckingProvider status: Unknown type: InstanceExists Expected results: The worker machines are created successfully. Additional info: