Bug 1949626 - machine-api fails to create AWS client in new regions
Summary: machine-api fails to create AWS client in new regions
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Michael McCune
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On:
Blocks: 1976192
TreeView+ depends on / blocked
 
Reported: 2021-04-14 17:09 UTC by Matthew Staebler
Modified: 2021-07-27 23:01 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1976192 1976198 (view as bug list)
Environment:
Last Closed: 2021-07-27 23:00:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
machine-controller logs (477.15 KB, text/plain)
2021-04-21 21:37 UTC, Matthew Staebler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-aws pull 403 0 None open Bug 1949626: update aws-sdk-go to v1.38.25 2021-04-21 20:57:19 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:01:03 UTC

Description Matthew Staebler 2021-04-14 17:09:38 UTC
Description of problem:
When using a new region that is unknown to the AWS SDK, the machine-api fails to create an AWS client.


How reproducible:
100%


Steps to Reproduce:
1. Install a 4.8 cluster in the ap-northeast-3 region.
2. See that the Machines report InstanceExists failures.

Actual results:
oc get machines -A -oyaml | yq -y '.items[] | select(.metadata.name | contains("worker-ap-northeast-3a")) | .status.conditions'
- lastTransitionTime: '2021-04-14T16:11:24Z'
  message: "Failed to check if machine exists: mstaeble-mgw8b-worker-ap-northeast-3a-c2b5q:\
    \ failed to create scope for machine: failed to create aws client: region \"ap-northeast-3\"\
    \ not resolved: UnknownEndpointError: could not resolve endpoint\n\tpartition:\
    \ \"all partitions\", service: \"ec2\", region: \"ap-northeast-3\""
  reason: ErrorCheckingProvider
  status: Unknown
  type: InstanceExists



Expected results:
The worker machines are created successfully.


Additional info:

Comment 1 Michael McCune 2021-04-21 20:36:55 UTC
just starting to take a look at this

Comment 2 Michael McCune 2021-04-21 20:41:00 UTC
@mstaeble do you happen to have a log for the machine controller?

i see a few places in the code where this might be coming from, but if you still have the logs handy it would make it easier. if not, i will re-create the error condition.

Comment 3 Michael McCune 2021-04-21 20:55:12 UTC
actually, i don't think it will be needed. i think this just requires us to update the aws-sdk-go that the actuator is using. i see the "ap-northeast-3" only appears in a special fips version in the code we are running, but the new version has much wider support for that zone. i am updating the sdk now and running tests.

Comment 4 Matthew Staebler 2021-04-21 21:36:11 UTC
Updating the aws-sdk-go will work for this particular region, but it does not solve the long-term issue of needing to support new regions as they come out without needing to update the code.

Comment 5 Matthew Staebler 2021-04-21 21:37:04 UTC
Created attachment 1774149 [details]
machine-controller logs

Here are the logs from the machine-controller.

Comment 6 Matthew Staebler 2021-04-21 21:38:41 UTC
(In reply to Matthew Staebler from comment #4)
> Updating the aws-sdk-go will work for this particular region, but it does
> not solve the long-term issue of needing to support new regions as they come
> out without needing to update the code.

Expanding on this, we will want to be able to support new regions in OpenShift 4.6. And, ideally, we would not need to wait for a new z release to add support.

Comment 7 Michael McCune 2021-04-21 23:08:20 UTC
i will need to dig into the aws-sdk-go code a little more, but my first impression is that it does validation internally for these region names. if that is the case, then our choices will be limited for how we handle this, perhaps there is a way to make the client pass-through the region without validation, or maybe we could propose an upstream change. unless there is another option that i'm just not thinking about.

i will continue to investigate the code, happy to entertain suggestions though. =)

Comment 8 Michael McCune 2021-04-21 23:15:14 UTC
i talked with Matthew on slack, there is another option here presented by the sdk. i am going to talk with the team about switching our implementation to use the other client methods so that we can avoid the update/backport cycle.

Comment 9 Michael Gugino 2021-04-22 00:13:43 UTC
We already have the ability to specify custom endpoints for a region, we had to implement this for govcloud.  Whatever the instructions are for setting up govcloud will work for new regions as far as I can tell.

Comment 11 Michael McCune 2021-05-06 13:23:16 UTC
Matt, i have converted our conversation about improving this functionality into a jira card. our team will plan this work for the next cycle, thanks for your suggestions around how to make it better =)

https://issues.redhat.com/browse/OCPCLOUD-1159

Comment 12 Milind Yadav 2021-05-07 07:10:05 UTC
Valiated on :4.8.0-0.nightly-2021-05-06-210840


Based on below , moving to VERIFIED.

Steps : Installed a IPI AWS cluster with region as 'AWS_REGION=ap-northeast-3' , all nodes and machines were as expected .

[miyadav@miyadav ~]$ oc get machines -A -o json | yq -y '.items[] | select(.metadata.name | contains("worker-ap-northeast-3a")) | .status.conditions'
- lastTransitionTime: '2021-05-07T05:59:03Z'
  status: 'True'
  type: InstanceExists
.
.
[miyadav@miyadav ~]$ oc get machines
ocNAME                                            PHASE     TYPE        REGION           ZONE              AGE
miyadav-07-dl4cf-master-0                       Running   m5.xlarge   ap-northeast-3   ap-northeast-3a   78m
miyadav-07-dl4cf-master-1                       Running   m5.xlarge   ap-northeast-3   ap-northeast-3b   78m
miyadav-07-dl4cf-master-2                       Running   m5.xlarge   ap-northeast-3   ap-northeast-3c   78m
miyadav-07-dl4cf-worker-ap-northeast-3a-kmzmp   Running   m5.large    ap-northeast-3   ap-northeast-3a   69m
miyadav-07-dl4cf-worker-ap-northeast-3b-lglrd   Running   m5.large    ap-northeast-3   ap-northeast-3b   69m
miyadav-07-dl4cf-worker-ap-northeast-3c-q2tpf   Running   m5.large    ap-northeast-3   ap-northeast-3c   69m
.
.

 [miyadav@miyadav ~]$ oc get nodes
NAME                                              STATUS   ROLES    AGE   VERSION
ip-10-0-139-223.ap-northeast-3.compute.internal   Ready    master   74m   v1.21.0-rc.0+291e731
ip-10-0-141-96.ap-northeast-3.compute.internal    Ready    worker   64m   v1.21.0-rc.0+291e731
ip-10-0-162-197.ap-northeast-3.compute.internal   Ready    master   74m   v1.21.0-rc.0+291e731
ip-10-0-178-142.ap-northeast-3.compute.internal   Ready    worker   64m   v1.21.0-rc.0+291e731
ip-10-0-206-140.ap-northeast-3.compute.internal   Ready    master   74m   v1.21.0-rc.0+291e731
ip-10-0-217-179.ap-northeast-3.compute.internal   Ready    worker   64m   v1.21.0-rc.0+291e731
[miyadav@miyadav ~]$

Comment 15 errata-xmlrpc 2021-07-27 23:00:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.