Bug 2060068 - machine-api-provider-aws creates EC2 instances with the default security group when no matching security group is found
Summary: machine-api-provider-aws creates EC2 instances with the default security grou...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.12.0
Assignee: Mike Fedosin
QA Contact: sunzhaohua
Jeana Routh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-02 16:02 UTC by Michael Shen
Modified: 2023-01-17 19:48 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, the Machine API provider for AWS did not verify that the security group defined in the machine specification exists. Instead of returning an error in this case, it used a default security group, which should not be used for {product-title} machines, and successfully created a machine without informing the user that the default group was used. With this release, the Machine API returns an error when users set either incorrect or empty security group names in the machine specification. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2060068[*BZ#2060068*])
Clone Of:
Environment:
Last Closed: 2023-01-17 19:47:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-actuator-pkg pull 232 0 None open Bug 2060068: add security groups to the minimal AWS prover spec 2022-06-09 13:00:07 UTC
Github openshift machine-api-provider-aws pull 41 0 None open Bug 2060068: return error when no security group found 2022-06-02 14:16:35 UTC
Github openshift machine-api-provider-aws pull 44 0 None open Bug 2060068: check securityGroupIDs for emptiness 2022-06-28 13:45:25 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:48:02 UTC

Description Michael Shen 2022-03-02 16:02:36 UTC
Description of problem:
The bug relates to the AWS machine-api-provider (https://github.com/openshift/machine-api-provider-aws) and occurred with a cluster on Openshift 4.8.29, but should be relevant to more recent versions. In OSD/ROSA, machinesets get created with .spec.template.spec.providerSpec.value containing a security group filter like this:

```
securityGroups:
- filters:
  - name: tag:Name
    values:
    - clustername-worker-sg
```

When no matches are found for that security group, the Machine API Controller does not mark this as a failure and as a result, the provisioned EC2 instance uses the default security group (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/default-custom-security-groups.html#default-security-group) which is has insufficient security group rules for a healthy cluster.

How reproducible:
100%

Steps to Reproduce:
1. Have an OSD/ROSA CCS cluster (CCS is important to reproduce step 2)
2. Modify the worker security group tag to anything different
3. Have the machine-api controller provision a new machine by any means

Actual results:
A new machine/node will be provisioned and join the cluster, but will experience networking issues. In the AWS console the EC2 instance backing that machine/node will have the default security group for its VPC assigned to it.

Expected results:
The machine-api-controller marks the machine provisioning as failed when the desired security group is not found instead of continuing on.

Additional info:
* Info message printed to logs when no matching security groups are found: https://github.com/openshift/machine-api-provider-aws/blob/70df0a59c894c4970ad6352885a904ed3c478d0c/pkg/actuators/machine/instances.go#L98-L100
* Since there's no error an empty list of security groups is passed along: https://github.com/openshift/machine-api-provider-aws/blob/70df0a59c894c4970ad6352885a904ed3c478d0c/pkg/actuators/machine/instances.go#L294-L312

Comment 1 Joel Speed 2022-05-05 13:47:19 UTC
Looks like the classic AWSism of returning the entire list of resources when the filter matches none. We take the first item from the list (the default in this case) so we need to make sure we guard against this and make sure that the item we are taking does actually match the filter.

Mike is going to look into this next week.

Comment 4 sunzhaohua 2022-06-22 10:51:58 UTC
Verify failed
clusterversion:4.11.0-0.nightly-2022-06-22-015220
Create a new machineset with invalid security group, the machine can be created successful and join the cluster, in the aws console the instance uses the default security group.
 
$ oc edit machine oadp-12470-92zh8-worker-us-east-2cc-r6pch
      securityGroups:
      - filters:
        - name: tag:Name
          values:
          - oadp-12470-92zh8-worker-sg-invalid

$ oc get machine                                                                                                               
NAME                                        PHASE     TYPE         REGION      ZONE         AGE
oadp-12470-92zh8-worker-us-east-2cc-r6pch   Running   m5.xlarge    us-east-2   us-east-2c   23m

I0622 10:26:41.589784       1 instances.go:78] Describing security groups based on filters
I0622 10:26:41.917203       1 instances.go:129] Describing subnets based on filters
I0622 10:26:43.316756       1 reconciler.go:109] Created Machine oadp-12470-92zh8-worker-us-east-2cc-r6pch
I0622 10:26:43.316771       1 machine_scope.go:167] oadp-12470-92zh8-worker-us-east-2cc-r6pch: Updating status
I0622 10:26:43.445181       1 machine_scope.go:193] oadp-12470-92zh8-worker-us-east-2cc-r6pch: finished calculating AWS status
I0622 10:26:43.445212       1 machine_scope.go:90] oadp-12470-92zh8-worker-us-east-2cc-r6pch: patching machine
I0622 10:26:43.445274       1 logr.go:252] events "msg"="Normal"  "message"="Created Machine oadp-12470-92zh8-worker-us-east-2cc-r6pch" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"oadp-12470-92zh8-worker-us-east-2cc-r6pch","uid":"ecee8ccb-8165-4389-95ce-87588ba43f45","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"59264"} "reason"="Create"
I0622 10:26:43.456313       1 controller.go:384] oadp-12470-92zh8-worker-us-east-2cc-r6pch: created instance, requeuing
I0622 10:26:43.456364       1 controller.go:179] oadp-12470-92zh8-worker-us-east-2cc-r6pch: reconciling Machine
I0622 10:26:43.456377       1 actuator.go:107] oadp-12470-92zh8-worker-us-east-2cc-r6pch: actuator checking if machine exists
I0622 10:26:43.531451       1 reconciler.go:479] oadp-12470-92zh8-worker-us-east-2cc-r6pch: Found instance by id: i-00c292eaad9b9fa73
I0622 10:26:43.531471       1 controller.go:305] oadp-12470-92zh8-worker-us-east-2cc-r6pch: reconciling machine triggers idempotent update
I0622 10:26:43.531475       1 actuator.go:124] oadp-12470-92zh8-worker-us-east-2cc-r6pch: actuator updating machine
I0622 10:26:43.531855       1 reconciler.go:176] oadp-12470-92zh8-worker-us-east-2cc-r6pch: updating machine
I0622 10:26:43.571992       1 reconciler.go:479] oadp-12470-92zh8-worker-us-east-2cc-r6pch: Found instance by id: i-00c292eaad9b9fa73
I0622 10:26:43.572030       1 reconciler.go:407] oadp-12470-92zh8-worker-us-east-2cc-r6pch: ProviderID set at machine spec: aws:///us-east-2c/i-00c292eaad9b9fa73

Comment 7 sunzhaohua 2022-07-18 16:21:15 UTC
Verified
clusterversion: 4.12.0-0.nightly-2022-07-17-215842

$ oc edit machineset huliu-aws412-945hh-worker-us-east-2c                      
          securityGroups:
          - filters:
            - name: tag:Name
              values:
              - huliu-aws412-945hh-worker-sg-invalid
machineset.machine.openshift.io/huliu-aws412-945hh-worker-us-east-2c edited

$ oc delete machine huliu-aws412-945hh-worker-us-east-2c-t98h4                  
machine.machine.openshift.io "huliu-aws412-945hh-worker-us-east-2c-t98h4" deleted

$ oc get machine                                                                                  
NAME                                         PHASE     TYPE         REGION      ZONE         AGE
huliu-aws412-945hh-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   106m
huliu-aws412-945hh-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   106m
huliu-aws412-945hh-master-2                  Running   m6i.xlarge   us-east-2   us-east-2c   106m
huliu-aws412-945hh-worker-us-east-2a-4gwxb   Running   m6i.xlarge   us-east-2   us-east-2a   13m
huliu-aws412-945hh-worker-us-east-2a-cndjb   Running   m6i.xlarge   us-east-2   us-east-2a   13m
huliu-aws412-945hh-worker-us-east-2b-t2rvp   Running   m6i.xlarge   us-east-2   us-east-2b   12m
huliu-aws412-945hh-worker-us-east-2c-tw4tz   Failed                                          2m11s

Comment 10 errata-xmlrpc 2023-01-17 19:47:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.