Description of problem: When using a new region that is unknown to the AWS SDK, the machine-api fails to create an AWS client. How reproducible: 100% Steps to Reproduce: 1. Install a 4.8 cluster in the ap-northeast-3 region. 2. See that the Machines report InstanceExists failures. Actual results: oc get machines -A -oyaml | yq -y '.items[] | select(.metadata.name | contains("worker-ap-northeast-3a")) | .status.conditions' - lastTransitionTime: '2021-04-14T16:11:24Z' message: "Failed to check if machine exists: mstaeble-mgw8b-worker-ap-northeast-3a-c2b5q:\ \ failed to create scope for machine: failed to create aws client: region \"ap-northeast-3\"\ \ not resolved: UnknownEndpointError: could not resolve endpoint\n\tpartition:\ \ \"all partitions\", service: \"ec2\", region: \"ap-northeast-3\"" reason: ErrorCheckingProvider status: Unknown type: InstanceExists Expected results: The worker machines are created successfully. Additional info:
just starting to take a look at this
@mstaeble do you happen to have a log for the machine controller? i see a few places in the code where this might be coming from, but if you still have the logs handy it would make it easier. if not, i will re-create the error condition.
actually, i don't think it will be needed. i think this just requires us to update the aws-sdk-go that the actuator is using. i see the "ap-northeast-3" only appears in a special fips version in the code we are running, but the new version has much wider support for that zone. i am updating the sdk now and running tests.
Updating the aws-sdk-go will work for this particular region, but it does not solve the long-term issue of needing to support new regions as they come out without needing to update the code.
Created attachment 1774149 [details] machine-controller logs Here are the logs from the machine-controller.
(In reply to Matthew Staebler from comment #4) > Updating the aws-sdk-go will work for this particular region, but it does > not solve the long-term issue of needing to support new regions as they come > out without needing to update the code. Expanding on this, we will want to be able to support new regions in OpenShift 4.6. And, ideally, we would not need to wait for a new z release to add support.
i will need to dig into the aws-sdk-go code a little more, but my first impression is that it does validation internally for these region names. if that is the case, then our choices will be limited for how we handle this, perhaps there is a way to make the client pass-through the region without validation, or maybe we could propose an upstream change. unless there is another option that i'm just not thinking about. i will continue to investigate the code, happy to entertain suggestions though. =)
i talked with Matthew on slack, there is another option here presented by the sdk. i am going to talk with the team about switching our implementation to use the other client methods so that we can avoid the update/backport cycle.
We already have the ability to specify custom endpoints for a region, we had to implement this for govcloud. Whatever the instructions are for setting up govcloud will work for new regions as far as I can tell.
Matt, i have converted our conversation about improving this functionality into a jira card. our team will plan this work for the next cycle, thanks for your suggestions around how to make it better =) https://issues.redhat.com/browse/OCPCLOUD-1159
Valiated on :4.8.0-0.nightly-2021-05-06-210840 Based on below , moving to VERIFIED. Steps : Installed a IPI AWS cluster with region as 'AWS_REGION=ap-northeast-3' , all nodes and machines were as expected . [miyadav@miyadav ~]$ oc get machines -A -o json | yq -y '.items[] | select(.metadata.name | contains("worker-ap-northeast-3a")) | .status.conditions' - lastTransitionTime: '2021-05-07T05:59:03Z' status: 'True' type: InstanceExists . . [miyadav@miyadav ~]$ oc get machines ocNAME PHASE TYPE REGION ZONE AGE miyadav-07-dl4cf-master-0 Running m5.xlarge ap-northeast-3 ap-northeast-3a 78m miyadav-07-dl4cf-master-1 Running m5.xlarge ap-northeast-3 ap-northeast-3b 78m miyadav-07-dl4cf-master-2 Running m5.xlarge ap-northeast-3 ap-northeast-3c 78m miyadav-07-dl4cf-worker-ap-northeast-3a-kmzmp Running m5.large ap-northeast-3 ap-northeast-3a 69m miyadav-07-dl4cf-worker-ap-northeast-3b-lglrd Running m5.large ap-northeast-3 ap-northeast-3b 69m miyadav-07-dl4cf-worker-ap-northeast-3c-q2tpf Running m5.large ap-northeast-3 ap-northeast-3c 69m . . [miyadav@miyadav ~]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-139-223.ap-northeast-3.compute.internal Ready master 74m v1.21.0-rc.0+291e731 ip-10-0-141-96.ap-northeast-3.compute.internal Ready worker 64m v1.21.0-rc.0+291e731 ip-10-0-162-197.ap-northeast-3.compute.internal Ready master 74m v1.21.0-rc.0+291e731 ip-10-0-178-142.ap-northeast-3.compute.internal Ready worker 64m v1.21.0-rc.0+291e731 ip-10-0-206-140.ap-northeast-3.compute.internal Ready master 74m v1.21.0-rc.0+291e731 ip-10-0-217-179.ap-northeast-3.compute.internal Ready worker 64m v1.21.0-rc.0+291e731 [miyadav@miyadav ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438