Description of problem: Machine should be "Failed" when creating a machine with invalid region How reproducible: always Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-12-09-112139 Steps to Reproduce: 1.Creating a machine with invalid region spec: metadata: {} providerSpec: value: ami: id: ami-0c0b5ea0a4ebe87d5 apiVersion: awsproviderconfig.openshift.io/v1beta1 blockDevices: - ebs: encrypted: true iops: 0 kmsKey: arn: "" volumeSize: 120 volumeType: gp2 credentialsSecret: name: aws-cloud-credentials deviceIndex: 0 iamInstanceProfile: id: zhsunaws111-5486g-worker-profile instanceType: m5.large kind: AWSMachineProviderConfig metadata: creationTimestamp: null placement: availabilityZone: us-east-2b region: us-east-2-invalid securityGroups: - filters: - name: tag:Name values: - zhsunaws111-5486g-worker-sg subnet: filters: - name: tag:Name values: - zhsunaws111-5486g-private-us-east-2b tags: - name: kubernetes.io/cluster/zhsunaws111-5486g value: owned userDataSecret: name: worker-user-data 2. Check machines and logs 3. Actual results: Machine has no status. $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaws111-5486g-master-0 Running m5.xlarge us-east-2 us-east-2a 169m zhsunaws111-5486g-master-1 Running m5.xlarge us-east-2 us-east-2b 169m zhsunaws111-5486g-master-2 Running m5.xlarge us-east-2 us-east-2c 169m zhsunaws111-5486g-worker-us-east-2a-2dpxm Running m5.large us-east-2 us-east-2a 158m zhsunaws111-5486g-worker-us-east-2b-cdbkw 52m I1211 09:31:57.486020 1 controller.go:170] zhsunaws111-5486g-worker-us-east-2b-cdbkw: reconciling Machine I1211 09:31:57.486036 1 actuator.go:100] zhsunaws111-5486g-worker-us-east-2b-cdbkw: actuator checking if machine exists E1211 09:31:57.774436 1 reconciler.go:236] zhsunaws111-5486g-worker-us-east-2b-cdbkw: error getting existing instances: RequestError: send request failed caused by: Post "https://ec2.us-east-2-invalid.amazonaws.com/": dial tcp: lookup ec2.us-east-2-invalid.amazonaws.com on 172.30.0.10:53: no such host E1211 09:31:57.774471 1 controller.go:273] zhsunaws111-5486g-worker-us-east-2b-cdbkw: failed to check if machine exists: RequestError: send request failed caused by: Post "https://ec2.us-east-2-invalid.amazonaws.com/": dial tcp: lookup ec2.us-east-2-invalid.amazonaws.com on 172.30.0.10:53: no such host E1211 09:31:57.774539 1 controller.go:237] controller "msg"="Reconciler error" "error"="RequestError: send request failed\ncaused by: Post \"https://ec2.us-east-2-invalid.amazonaws.com/\": dial tcp: lookup ec2.us-east-2-invalid.amazonaws.com on 172.30.0.10:53: no such host" "controller"="machine_controller" "name"="zhsunaws111-5486g-worker-us-east-2b-cdbkw" "namespace"="openshift-machine-api" Expected results: The machine phase is set "Failed" Additional info: Creating machine with invalid zone, machine will be in Failed phase.
So I had a look into this and came up with a small change that we could introduce into the AWS actuator [1]. To make this work, we would need to change the machine controller to include a check after it calls `Exists` to determine if the errors are invalid configuration or not. We have historically not wanted to change the way that `Exists` works so that if credentials did become invalid, it would make a Machine Failed. If we were to do this, there's a high chance we could leak instances if credentials became invalid for example. My suggestion here was that we should try create an error for the update (via webhooks) to prevent users setting an invalid region instead. [1]: https://github.com/openshift/cluster-api-provider-aws/commit/b13fa74126a84b246dbed7c9d0fd5418c23a8353
We do this explicitly for GCP already: https://github.com/openshift/cluster-api-provider-gcp/blob/master/pkg/cloud/gcp/actuators/machine/reconciler.go#L327-L332 This is because GCP will throw the same 404 for each /compute/good-project/my-instance and /compute/bad-project/my-instance. Let's not involve webhooks here.
We don't check for the invalid configuration on exists, so the code you've linked doesn't actually do anything right now https://github.com/openshift/machine-api-operator/blob/d9e48bb9fc0b1bc069279fb30db9a57c6096f9fa/pkg/controller/machine/controller.go#L272-L276 This is why we currently have a bug https://bugzilla.redhat.com/show_bug.cgi?id=1906742 for that feature as well. Would this not be better handled in webhooks? It will be safer to prevent MachineSet/Machine specs from being updated to include invalid regions/zones than to fail machines when exists fails wouldn't it? I fear if we ever fail a machine after `Exists` we risk leaking machines
We check it right here during the deletion flow: https://github.com/openshift/machine-api-operator/blob/d9e48bb9fc0b1bc069279fb30db9a57c6096f9fa/pkg/controller/machine/controller.go#L225-L231 IMO, the proper thing to do is to see what error AWS is throwing us and go failed if it's something that is equivalent to 'invalid configuration.' I don't want to try to validate every bit of information a user passes us that might contain a typo. Other fields that would have this same problem are: instance type subnet id security group placement ami block devices Literally the entire providerSpec. All in all, I'm not really interested in trying to account for any of this, including going failed. When you create a new machineset, scale it up to at least 1, make sure it works. The UX we need is to ensure that whatever error AWS is giving us makes it into the status field of the machine object so nobody needs to look at the logs to deduce what's going on.
I agree that we don't want to try and validate every single thing in the providerspec, that doesn't scale well and won't be maintainble. However, most of the things that you have listed in your list of things you don't want to check, return a invalid configuration error when we call create anyway, so we don't need to validate those (ie AWS RunInstances just returns a 400 because we gave it bad params). The region however is a little more complex and needs its own special treatment to prevent continuously cycling on trying to resolve an endpoint that doesn't exist. > IMO, the proper thing to do is to see what error AWS is throwing us and go failed if it's something that is equivalent to 'invalid configuration.' I'm going to raise a PR that changes the behaviour of what we are doing slightly. It checks whether the endpoint is known to the client rather than the current error of just getting a DNS lookup error, this should make the error of the region being invalid more obvious, and then we can continue the discussion about whether this should make the machine go failed or not.
@zhsun Could you validate that the change that has been merged improves the error messages coming out in the logs? It won't solve the issue of going failed yet, but would like to see this change verified before we make any more changes, just to be sure this isn't going to break other things
@Joel Speed The change improves the error messages coming out in the logs clusterversion: 4.7.0-0.nightly-2020-12-17-201522 I1218 05:24:46.825618 1 controller.go:171] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: reconciling Machine I1218 05:24:46.825646 1 actuator.go:100] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: actuator checking if machine exists E1218 05:24:46.826578 1 controller.go:274] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to check if machine exists: zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to create scope for machine: failed to create aws client: region "us-east-2-invalid" not resolved: UnknownEndpointError: could not resolve endpoint partition: "all partitions", service: "ec2", region: "us-east-2-invalid" E1218 05:24:46.826635 1 controller.go:237] controller "msg"="Reconciler error" "error"="zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to create scope for machine: failed to create aws client: region \"us-east-2-invalid\" not resolved: UnknownEndpointError: could not resolve endpoint\n\tpartition: \"all partitions\", service: \"ec2\", region: \"us-east-2-invalid\"" "controller"="machine_controller" "name"="zhsun18aws-lrb8x-worker-us-east-2c-2bnrx" "namespace"="openshift-machine-api"
Thanks @zhsun, I'm think we could probably reuse the same client constructor (or similar) to give the same errors in the webhooks highlighting this to users earlier. Will take a look at this in the new year if no one on the team picks it up before then.
Will try to come back to this and surface the warning earlier in the flow during this cycle
To whoever picks this up: If you check the attached PR, I updated the client to perform a check that the region is valid, this will be done as part of `NewValidatedClient`. In the webhooks we check the client credentials exist, we could use this client build to check that the credentials are valid and add a warning if we think that they aren't
I think we should close won't-fix these types of issues. This is similar to creating a podspec with an invalid image. At best, we can set failed, but we need to ensure that the failure is permanent and the API will never give us this class of error if the region does exist.
We had some discussion out of band, we've improved the log messages now to a point where it's obvious that the region is invalid. We decided it's not worth verifying the client in the webhook as it doesn't add that much value for the complexity it introduces. @zhsun Could we verify that the log message is working and mark this verified? This is the only fix we feel appropriate for this issue right now
verified clusterversion: 4.8.0-0.nightly-2021-03-29-000904 I1218 05:24:46.825618 1 controller.go:171] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: reconciling Machine I1218 05:24:46.825646 1 actuator.go:100] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: actuator checking if machine exists E1218 05:24:46.826578 1 controller.go:274] zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to check if machine exists: zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to create scope for machine: failed to create aws client: region "us-east-2-invalid" not resolved: UnknownEndpointError: could not resolve endpoint partition: "all partitions", service: "ec2", region: "us-east-2-invalid" E1218 05:24:46.826635 1 controller.go:237] controller "msg"="Reconciler error" "error"="zhsun18aws-lrb8x-worker-us-east-2c-2bnrx: failed to create scope for machine: failed to create aws client: region \"us-east-2-invalid\" not resolved: UnknownEndpointError: could not resolve endpoint\n\tpartition: \"all partitions\", service: \"ec2\", region: \"us-east-2-invalid\"" "controller"="machine_controller" "name"="zhsun18aws-lrb8x-worker-us-east-2c-2bnrx" "namespace"="openshift-machine-api"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438