Description of problem: The bug relates to the AWS machine-api-provider (https://github.com/openshift/machine-api-provider-aws) and occurred with a cluster on Openshift 4.8.29, but should be relevant to more recent versions. In OSD/ROSA, machinesets get created with .spec.template.spec.providerSpec.value containing a security group filter like this: ``` securityGroups: - filters: - name: tag:Name values: - clustername-worker-sg ``` When no matches are found for that security group, the Machine API Controller does not mark this as a failure and as a result, the provisioned EC2 instance uses the default security group (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/default-custom-security-groups.html#default-security-group) which is has insufficient security group rules for a healthy cluster. How reproducible: 100% Steps to Reproduce: 1. Have an OSD/ROSA CCS cluster (CCS is important to reproduce step 2) 2. Modify the worker security group tag to anything different 3. Have the machine-api controller provision a new machine by any means Actual results: A new machine/node will be provisioned and join the cluster, but will experience networking issues. In the AWS console the EC2 instance backing that machine/node will have the default security group for its VPC assigned to it. Expected results: The machine-api-controller marks the machine provisioning as failed when the desired security group is not found instead of continuing on. Additional info: * Info message printed to logs when no matching security groups are found: https://github.com/openshift/machine-api-provider-aws/blob/70df0a59c894c4970ad6352885a904ed3c478d0c/pkg/actuators/machine/instances.go#L98-L100 * Since there's no error an empty list of security groups is passed along: https://github.com/openshift/machine-api-provider-aws/blob/70df0a59c894c4970ad6352885a904ed3c478d0c/pkg/actuators/machine/instances.go#L294-L312
Looks like the classic AWSism of returning the entire list of resources when the filter matches none. We take the first item from the list (the default in this case) so we need to make sure we guard against this and make sure that the item we are taking does actually match the filter. Mike is going to look into this next week.
Verify failed clusterversion:4.11.0-0.nightly-2022-06-22-015220 Create a new machineset with invalid security group, the machine can be created successful and join the cluster, in the aws console the instance uses the default security group. $ oc edit machine oadp-12470-92zh8-worker-us-east-2cc-r6pch securityGroups: - filters: - name: tag:Name values: - oadp-12470-92zh8-worker-sg-invalid $ oc get machine NAME PHASE TYPE REGION ZONE AGE oadp-12470-92zh8-worker-us-east-2cc-r6pch Running m5.xlarge us-east-2 us-east-2c 23m I0622 10:26:41.589784 1 instances.go:78] Describing security groups based on filters I0622 10:26:41.917203 1 instances.go:129] Describing subnets based on filters I0622 10:26:43.316756 1 reconciler.go:109] Created Machine oadp-12470-92zh8-worker-us-east-2cc-r6pch I0622 10:26:43.316771 1 machine_scope.go:167] oadp-12470-92zh8-worker-us-east-2cc-r6pch: Updating status I0622 10:26:43.445181 1 machine_scope.go:193] oadp-12470-92zh8-worker-us-east-2cc-r6pch: finished calculating AWS status I0622 10:26:43.445212 1 machine_scope.go:90] oadp-12470-92zh8-worker-us-east-2cc-r6pch: patching machine I0622 10:26:43.445274 1 logr.go:252] events "msg"="Normal" "message"="Created Machine oadp-12470-92zh8-worker-us-east-2cc-r6pch" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"oadp-12470-92zh8-worker-us-east-2cc-r6pch","uid":"ecee8ccb-8165-4389-95ce-87588ba43f45","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"59264"} "reason"="Create" I0622 10:26:43.456313 1 controller.go:384] oadp-12470-92zh8-worker-us-east-2cc-r6pch: created instance, requeuing I0622 10:26:43.456364 1 controller.go:179] oadp-12470-92zh8-worker-us-east-2cc-r6pch: reconciling Machine I0622 10:26:43.456377 1 actuator.go:107] oadp-12470-92zh8-worker-us-east-2cc-r6pch: actuator checking if machine exists I0622 10:26:43.531451 1 reconciler.go:479] oadp-12470-92zh8-worker-us-east-2cc-r6pch: Found instance by id: i-00c292eaad9b9fa73 I0622 10:26:43.531471 1 controller.go:305] oadp-12470-92zh8-worker-us-east-2cc-r6pch: reconciling machine triggers idempotent update I0622 10:26:43.531475 1 actuator.go:124] oadp-12470-92zh8-worker-us-east-2cc-r6pch: actuator updating machine I0622 10:26:43.531855 1 reconciler.go:176] oadp-12470-92zh8-worker-us-east-2cc-r6pch: updating machine I0622 10:26:43.571992 1 reconciler.go:479] oadp-12470-92zh8-worker-us-east-2cc-r6pch: Found instance by id: i-00c292eaad9b9fa73 I0622 10:26:43.572030 1 reconciler.go:407] oadp-12470-92zh8-worker-us-east-2cc-r6pch: ProviderID set at machine spec: aws:///us-east-2c/i-00c292eaad9b9fa73
Verified clusterversion: 4.12.0-0.nightly-2022-07-17-215842 $ oc edit machineset huliu-aws412-945hh-worker-us-east-2c securityGroups: - filters: - name: tag:Name values: - huliu-aws412-945hh-worker-sg-invalid machineset.machine.openshift.io/huliu-aws412-945hh-worker-us-east-2c edited $ oc delete machine huliu-aws412-945hh-worker-us-east-2c-t98h4 machine.machine.openshift.io "huliu-aws412-945hh-worker-us-east-2c-t98h4" deleted $ oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws412-945hh-master-0 Running m6i.xlarge us-east-2 us-east-2a 106m huliu-aws412-945hh-master-1 Running m6i.xlarge us-east-2 us-east-2b 106m huliu-aws412-945hh-master-2 Running m6i.xlarge us-east-2 us-east-2c 106m huliu-aws412-945hh-worker-us-east-2a-4gwxb Running m6i.xlarge us-east-2 us-east-2a 13m huliu-aws412-945hh-worker-us-east-2a-cndjb Running m6i.xlarge us-east-2 us-east-2a 13m huliu-aws412-945hh-worker-us-east-2b-t2rvp Running m6i.xlarge us-east-2 us-east-2b 12m huliu-aws412-945hh-worker-us-east-2c-tw4tz Failed 2m11s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399