Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1949626

Summary:

machine-api fails to create AWS client in new regions

Product:

OpenShift Container Platform

Reporter:

Matthew Staebler <mstaeble>

Component:

Cloud Compute

Assignee:

Michael McCune <mimccune>

Cloud Compute sub component:

Other Providers

QA Contact:

Milind Yadav <miyadav>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

mgugino, mimccune, yunjiang

Version:

4.8

Target Milestone:

---

Target Release:

4.8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Clones:

1976192 1976198 (view as bug list)

Environment:

Last Closed:

2021-07-27 23:00:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1976192

Attachments:

Description	Flags
machine-controller logs	none

Description Matthew Staebler 2021-04-14 17:09:38 UTC

Description of problem:
When using a new region that is unknown to the AWS SDK, the machine-api fails to create an AWS client.


How reproducible:
100%


Steps to Reproduce:
1. Install a 4.8 cluster in the ap-northeast-3 region.
2. See that the Machines report InstanceExists failures.

Actual results:
oc get machines -A -oyaml | yq -y '.items[] | select(.metadata.name | contains("worker-ap-northeast-3a")) | .status.conditions'
- lastTransitionTime: '2021-04-14T16:11:24Z'
  message: "Failed to check if machine exists: mstaeble-mgw8b-worker-ap-northeast-3a-c2b5q:\
    \ failed to create scope for machine: failed to create aws client: region \"ap-northeast-3\"\
    \ not resolved: UnknownEndpointError: could not resolve endpoint\n\tpartition:\
    \ \"all partitions\", service: \"ec2\", region: \"ap-northeast-3\""
  reason: ErrorCheckingProvider
  status: Unknown
  type: InstanceExists



Expected results:
The worker machines are created successfully.


Additional info:

Comment 1 Michael McCune 2021-04-21 20:36:55 UTC

just starting to take a look at this

Comment 2 Michael McCune 2021-04-21 20:41:00 UTC

@mstaeble do you happen to have a log for the machine controller?

i see a few places in the code where this might be coming from, but if you still have the logs handy it would make it easier. if not, i will re-create the error condition.

Comment 3 Michael McCune 2021-04-21 20:55:12 UTC

actually, i don't think it will be needed. i think this just requires us to update the aws-sdk-go that the actuator is using. i see the "ap-northeast-3" only appears in a special fips version in the code we are running, but the new version has much wider support for that zone. i am updating the sdk now and running tests.

Comment 4 Matthew Staebler 2021-04-21 21:36:11 UTC

Updating the aws-sdk-go will work for this particular region, but it does not solve the long-term issue of needing to support new regions as they come out without needing to update the code.

Comment 5 Matthew Staebler 2021-04-21 21:37:04 UTC

Created attachment 1774149 [details]
machine-controller logs

Here are the logs from the machine-controller.

Comment 6 Matthew Staebler 2021-04-21 21:38:41 UTC

(In reply to Matthew Staebler from comment #4)
> Updating the aws-sdk-go will work for this particular region, but it does
> not solve the long-term issue of needing to support new regions as they come
> out without needing to update the code.

Expanding on this, we will want to be able to support new regions in OpenShift 4.6. And, ideally, we would not need to wait for a new z release to add support.

Comment 7 Michael McCune 2021-04-21 23:08:20 UTC

i will need to dig into the aws-sdk-go code a little more, but my first impression is that it does validation internally for these region names. if that is the case, then our choices will be limited for how we handle this, perhaps there is a way to make the client pass-through the region without validation, or maybe we could propose an upstream change. unless there is another option that i'm just not thinking about.

i will continue to investigate the code, happy to entertain suggestions though. =)

Comment 8 Michael McCune 2021-04-21 23:15:14 UTC

i talked with Matthew on slack, there is another option here presented by the sdk. i am going to talk with the team about switching our implementation to use the other client methods so that we can avoid the update/backport cycle.

Comment 9 Michael Gugino 2021-04-22 00:13:43 UTC

We already have the ability to specify custom endpoints for a region, we had to implement this for govcloud.  Whatever the instructions are for setting up govcloud will work for new regions as far as I can tell.

Comment 11 Michael McCune 2021-05-06 13:23:16 UTC

Matt, i have converted our conversation about improving this functionality into a jira card. our team will plan this work for the next cycle, thanks for your suggestions around how to make it better =)

https://issues.redhat.com/browse/OCPCLOUD-1159

Comment 12 Milind Yadav 2021-05-07 07:10:05 UTC

Valiated on :4.8.0-0.nightly-2021-05-06-210840


Based on below , moving to VERIFIED.

Steps : Installed a IPI AWS cluster with region as 'AWS_REGION=ap-northeast-3' , all nodes and machines were as expected .

[miyadav@miyadav ~]$ oc get machines -A -o json | yq -y '.items[] | select(.metadata.name | contains("worker-ap-northeast-3a")) | .status.conditions'
- lastTransitionTime: '2021-05-07T05:59:03Z'
  status: 'True'
  type: InstanceExists
.
.
[miyadav@miyadav ~]$ oc get machines
ocNAME                                            PHASE     TYPE        REGION           ZONE              AGE
miyadav-07-dl4cf-master-0                       Running   m5.xlarge   ap-northeast-3   ap-northeast-3a   78m
miyadav-07-dl4cf-master-1                       Running   m5.xlarge   ap-northeast-3   ap-northeast-3b   78m
miyadav-07-dl4cf-master-2                       Running   m5.xlarge   ap-northeast-3   ap-northeast-3c   78m
miyadav-07-dl4cf-worker-ap-northeast-3a-kmzmp   Running   m5.large    ap-northeast-3   ap-northeast-3a   69m
miyadav-07-dl4cf-worker-ap-northeast-3b-lglrd   Running   m5.large    ap-northeast-3   ap-northeast-3b   69m
miyadav-07-dl4cf-worker-ap-northeast-3c-q2tpf   Running   m5.large    ap-northeast-3   ap-northeast-3c   69m
.
.

 [miyadav@miyadav ~]$ oc get nodes
NAME                                              STATUS   ROLES    AGE   VERSION
ip-10-0-139-223.ap-northeast-3.compute.internal   Ready    master   74m   v1.21.0-rc.0+291e731
ip-10-0-141-96.ap-northeast-3.compute.internal    Ready    worker   64m   v1.21.0-rc.0+291e731
ip-10-0-162-197.ap-northeast-3.compute.internal   Ready    master   74m   v1.21.0-rc.0+291e731
ip-10-0-178-142.ap-northeast-3.compute.internal   Ready    worker   64m   v1.21.0-rc.0+291e731
ip-10-0-206-140.ap-northeast-3.compute.internal   Ready    master   74m   v1.21.0-rc.0+291e731
ip-10-0-217-179.ap-northeast-3.compute.internal   Ready    worker   64m   v1.21.0-rc.0+291e731
[miyadav@miyadav ~]$

Comment 15 errata-xmlrpc 2021-07-27 23:00:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438