Bug 1906740 - [aws]Machine should be "Failed" when creating a machine with invalid region
Summary: [aws]Machine should be "Failed" when creating a machine with invalid region
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.8.0
Assignee: Alexander Demicev
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-11 10:47 UTC by sunzhaohua
Modified: 2021-07-27 22:35 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:35:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-aws pull 380 0 None closed Bug 1906740: Ensure the region is valid when creating AWS client 2021-02-08 14:01:21 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:35:27 UTC

Description sunzhaohua 2020-12-11 10:47:22 UTC
Description of problem:
Machine should be "Failed" when creating a machine with invalid region

How reproducible:
always

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-12-09-112139

Steps to Reproduce:
1.Creating a machine with invalid region
    spec:
      metadata: {}
      providerSpec:
        value:
          ami:
            id: ami-0c0b5ea0a4ebe87d5
          apiVersion: awsproviderconfig.openshift.io/v1beta1
          blockDevices:
          - ebs:
              encrypted: true
              iops: 0
              kmsKey:
                arn: ""
              volumeSize: 120
              volumeType: gp2
          credentialsSecret:
            name: aws-cloud-credentials
          deviceIndex: 0
          iamInstanceProfile:
            id: zhsunaws111-5486g-worker-profile
          instanceType: m5.large
          kind: AWSMachineProviderConfig
          metadata:
            creationTimestamp: null
          placement:
            availabilityZone: us-east-2b
            region: us-east-2-invalid
          securityGroups:
          - filters:
            - name: tag:Name
              values:
              - zhsunaws111-5486g-worker-sg
          subnet:
            filters:
            - name: tag:Name
              values:
              - zhsunaws111-5486g-private-us-east-2b
          tags:
          - name: kubernetes.io/cluster/zhsunaws111-5486g
            value: owned
          userDataSecret:
            name: worker-user-data
2. Check machines and logs
3.

Actual results:
Machine has no status.
 
$ oc get machine
NAME                                        PHASE      TYPE        REGION      ZONE         AGE
zhsunaws111-5486g-master-0                  Running    m5.xlarge   us-east-2   us-east-2a   169m
zhsunaws111-5486g-master-1                  Running    m5.xlarge   us-east-2   us-east-2b   169m
zhsunaws111-5486g-master-2                  Running    m5.xlarge   us-east-2   us-east-2c   169m
zhsunaws111-5486g-worker-us-east-2a-2dpxm   Running    m5.large    us-east-2   us-east-2a   158m
zhsunaws111-5486g-worker-us-east-2b-cdbkw                                                   52m


I1211 09:31:57.486020       1 controller.go:170] zhsunaws111-5486g-worker-us-east-2b-cdbkw: reconciling Machine
I1211 09:31:57.486036       1 actuator.go:100] zhsunaws111-5486g-worker-us-east-2b-cdbkw: actuator checking if machine exists
E1211 09:31:57.774436       1 reconciler.go:236] zhsunaws111-5486g-worker-us-east-2b-cdbkw: error getting existing instances: RequestError: send request failed
caused by: Post "https://ec2.us-east-2-invalid.amazonaws.com/": dial tcp: lookup ec2.us-east-2-invalid.amazonaws.com on 172.30.0.10:53: no such host
E1211 09:31:57.774471       1 controller.go:273] zhsunaws111-5486g-worker-us-east-2b-cdbkw: failed to check if machine exists: RequestError: send request failed
caused by: Post "https://ec2.us-east-2-invalid.amazonaws.com/": dial tcp: lookup ec2.us-east-2-invalid.amazonaws.com on 172.30.0.10:53: no such host
E1211 09:31:57.774539       1 controller.go:237] controller "msg"="Reconciler error" "error"="RequestError: send request failed\ncaused by: Post \"https://ec2.us-east-2-invalid.amazonaws.com/\": dial tcp: lookup ec2.us-east-2-invalid.amazonaws.com on 172.30.0.10:53: no such host" "controller"="machine_controller" "name"="zhsunaws111-5486g-worker-us-east-2b-cdbkw" "namespace"="openshift-machine-api" 

Expected results:
The machine phase is set "Failed"

Additional info:
Creating machine with invalid zone, machine will be in Failed phase.

Comment 1 Joel Speed 2020-12-16 13:43:03 UTC
So I had a look into this and came up with a small change that we could introduce into the AWS actuator [1].

To make this work, we would need to change the machine controller to include a check after it calls `Exists` to determine if the errors are invalid configuration or not.

We have historically not wanted to change the way that `Exists` works so that if credentials did become invalid, it would make a Machine Failed. If we were to do this, there's a high chance we could leak instances if credentials became invalid for example.

My suggestion here was that we should try create an error for the update (via webhooks) to prevent users setting an invalid region instead.

[1]: https://github.com/openshift/cluster-api-provider-aws/commit/b13fa74126a84b246dbed7c9d0fd5418c23a8353

Comment 2 Michael Gugino 2020-12-16 13:49:20 UTC
We do this explicitly for GCP already: https://github.com/openshift/cluster-api-provider-gcp/blob/master/pkg/cloud/gcp/actuators/machine/reconciler.go#L327-L332

This is because GCP will throw the same 404 for each /compute/good-project/my-instance and /compute/bad-project/my-instance.

Let's not involve webhooks here.

Comment 3 Joel Speed 2020-12-16 14:10:44 UTC
We don't check for the invalid configuration on exists, so the code you've linked doesn't actually do anything right now https://github.com/openshift/machine-api-operator/blob/d9e48bb9fc0b1bc069279fb30db9a57c6096f9fa/pkg/controller/machine/controller.go#L272-L276

This is why we currently have a bug https://bugzilla.redhat.com/show_bug.cgi?id=1906742 for that feature as well.

Would this not be better handled in webhooks? It will be safer to prevent MachineSet/Machine specs from being updated to include invalid regions/zones than to fail machines when exists fails wouldn't it? I fear if we ever fail a machine after `Exists` we risk leaking machines

Comment 4 Michael Gugino 2020-12-16 14:31:42 UTC
We check it right here during the deletion flow: https://github.com/openshift/machine-api-operator/blob/d9e48bb9fc0b1bc069279fb30db9a57c6096f9fa/pkg/controller/machine/controller.go#L225-L231

IMO, the proper thing to do is to see what error AWS is throwing us and go failed if it's something that is equivalent to 'invalid configuration.'  I don't want to try to validate every bit of information a user passes us that might contain a typo.  Other fields that would have this same problem are:

instance type
subnet id
security group
placement
ami
block devices

Literally the entire providerSpec.

All in all, I'm not really interested in trying to account for any of this, including going failed.  When you create a new machineset, scale it up to at least 1, make sure it works.  The UX we need is to ensure that whatever error AWS is giving us makes it into the status field of the machine object so nobody needs to look at the logs to deduce what's going on.

Comment 5 Joel Speed 2020-12-16 14:51:52 UTC
I agree that we don't want to try and validate every single thing in the providerspec, that doesn't scale well and won't be maintainble.

However, most of the things that you have listed in your list of things you don't want to check, return a invalid configuration error when we call create anyway, so we don't need to validate those (ie AWS RunInstances just returns a 400 because we gave it bad params).

The region however is a little more complex and needs its own special treatment to prevent continuously cycling on trying to resolve an endpoint that doesn't exist.

> IMO, the proper thing to do is to see what error AWS is throwing us and go failed if it's something that is equivalent to 'invalid configuration.'

I'm going to raise a PR that changes the behaviour of what we are doing slightly. It checks whether the endpoint is known to the client rather than the current error of just getting a DNS lookup error, this should make the error of the region being invalid more obvious, and then we can continue the discussion about whether this should make the machine go failed or not.

Comment 6 Joel Speed 2020-12-16 17:38:47 UTC
@zhsun Could you validate that the change that has been merged improves the error messages coming out in the logs? It won't solve the issue of going failed yet, but would like to see this change verified before we make any more changes, just to be sure this isn't going to break other things

Comment 9 sunzhaohua 2020-12-18 05:32:05 UTC
@Joel Speed  The change improves the error messages coming out in the logs

clusterversion: 4.7.0-0.nightly-2020-12-17-201522

I1218 05:24:46.825618       1 controller.go:171] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: reconciling Machine
I1218 05:24:46.825646       1 actuator.go:100] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: actuator checking if machine exists
E1218 05:24:46.826578       1 controller.go:274] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to check if machine exists: zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to create scope for machine: failed to create aws client: region "us-east-2-invalid" not resolved: UnknownEndpointError: could not resolve endpoint
	partition: "all partitions", service: "ec2", region: "us-east-2-invalid"
E1218 05:24:46.826635       1 controller.go:237] controller "msg"="Reconciler error" "error"="zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to create scope for machine: failed to create aws client: region \"us-east-2-invalid\" not resolved: UnknownEndpointError: could not resolve endpoint\n\tpartition: \"all partitions\", service: \"ec2\", region: \"us-east-2-invalid\"" "controller"="machine_controller" "name"="zhsun18aws-lrb8x-worker-us-east-2c-2bnrx" "namespace"="openshift-machine-api"

Comment 10 Joel Speed 2020-12-18 10:02:46 UTC
Thanks @zhsun, I'm think we could probably reuse the same client constructor (or similar) to give the same errors in the webhooks highlighting this to users earlier. Will take a look at this in the new year if no one on the team picks it up before then.

Comment 12 Joel Speed 2021-02-08 14:10:52 UTC
Will try to come back to this and surface the warning earlier in the flow during this cycle

Comment 13 Joel Speed 2021-03-24 15:26:41 UTC
To whoever picks this up: If you check the attached PR, I updated the client to perform a check that the region is valid, this will be done as part of `NewValidatedClient`. In the webhooks we check the client credentials exist, we could use this client build to check that the credentials are valid and add a warning if we think that they aren't

Comment 14 Michael Gugino 2021-03-24 17:42:29 UTC
I think we should close won't-fix these types of issues.  This is similar to creating a podspec with an invalid image.

At best, we can set failed, but we need to ensure that the failure is permanent and the API will never give us this class of error if the region does exist.

Comment 15 Joel Speed 2021-03-25 15:50:58 UTC
We had some discussion out of band, we've improved the log messages now to a point where it's obvious that the region is invalid.
We decided it's not worth verifying the client in the webhook as it doesn't add that much value for the complexity it introduces.

@zhsun Could we verify that the log message is working and mark this verified? This is the only fix we feel appropriate for this issue right now

Comment 16 sunzhaohua 2021-03-29 06:49:44 UTC
verified
clusterversion: 4.8.0-0.nightly-2021-03-29-000904

I1218 05:24:46.825618       1 controller.go:171] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: reconciling Machine
I1218 05:24:46.825646       1 actuator.go:100] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: actuator checking if machine exists
E1218 05:24:46.826578       1 controller.go:274] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to check if machine exists: zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to create scope for machine: failed to create aws client: region "us-east-2-invalid" not resolved: UnknownEndpointError: could not resolve endpoint
	partition: "all partitions", service: "ec2", region: "us-east-2-invalid"
E1218 05:24:46.826635       1 controller.go:237] controller "msg"="Reconciler error" "error"="zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to create scope for machine: failed to create aws client: region \"us-east-2-invalid\" not resolved: UnknownEndpointError: could not resolve endpoint\n\tpartition: \"all partitions\", service: \"ec2\", region: \"us-east-2-invalid\"" "controller"="machine_controller" "name"="zhsun18aws-lrb8x-worker-us-east-2c-2bnrx" "namespace"="openshift-machine-api"

Comment 19 errata-xmlrpc 2021-07-27 22:35:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.