Description of problem: When scaling out a machineset fails in provisioning phase (e.g. caused by misconfiguration in the machineset) the OSP port created as part of provisioning are not cleaned up. Version-Release number of selected component (if applicable): 4.9.27 How reproducible: 100% Steps to Reproduce: 1. verify existing ports on new cluster $ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status +-------------------------------------------------------------------------+--------+ | Name | Status | +-------------------------------------------------------------------------+--------+ | mycluster-2w5xs-ingress-port | DOWN | | mycluster-2w5xs-master-0 | ACTIVE | | mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-master-2 | ACTIVE | | mycluster-2w5xs-master-1 | ACTIVE | | mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-api-port | DOWN | +-------------------------------------------------------------------------+--------+ 2. create machineset with bogus serverGroupID $ oc get machineset mycluster-2w5xs-worker-0 -o yaml > /tmp/machineset.yaml # rename mycluster-2w5xs-worker-0 to mycluster-2w5xs-worker-0-bogus-servergroup # remove status and version fields # decrease replicas 3 to 1 # introduce bogus serverGroupID $ vi /tmp/machineset.yaml $ yq '.spec.template.spec.providerSpec.value.serverGroupID' < /tmp/machineset.yaml abcd-1234 $ oc apply -f /tmp/machineset.yaml machineset.machine.openshift.io/mycluster-2w5xs-worker-0-bogus-servergroup created 3. confirm new machines is provisioned $ oc get machines NAME PHASE TYPE REGION ZONE AGE mycluster-2w5xs-master-0 Running ocp4.master regionOne nova 59m mycluster-2w5xs-master-1 Running ocp4.master regionOne nova 59m mycluster-2w5xs-master-2 Running ocp4.master regionOne nova 59m mycluster-2w5xs-worker-0-8tslk Running ocp4.master regionOne nova 51m mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q Provisioning 40s <<<-- mycluster-2w5xs-worker-0-hskzz Running ocp4.master regionOne nova 51m mycluster-2w5xs-worker-0-wxbvl Running ocp4.master regionOne nova 51m 4. confirm provisioning fails due to invalid serverGroupID $ oc logs machine-api-controllers-55597bc8dd-ffgmd -c machine-controller <...> E0408 08:21:21.263391 1 actuator.go:574] Machine error mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: error creating Openstack instance: Group must be a UUID W0408 08:21:21.263504 1 controller.go:366] mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: failed to create machine: error creating Openstack instance: Group must be a UUID E0408 08:21:21.263596 1 controller.go:304] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Group must be a UUID" "name"="mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q" "namespace"="openshift-machine-api" <...> 5. confirm new port is created $ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status +-------------------------------------------------------------------------------------------+--------+ | Name | Status | +-------------------------------------------------------------------------------------------+--------+ | mycluster-2w5xs-ingress-port | DOWN | | mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | DOWN | <<<-- | mycluster-2w5xs-master-0 | ACTIVE | | mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-master-2 | ACTIVE | | mycluster-2w5xs-master-1 | ACTIVE | | mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-api-port | DOWN | +-------------------------------------------------------------------------------------------+--------+ 6. scale down machineset $ oc scale machineset mycluster-2w5xs-worker-0-bogus-servergroup --replicas=0 machineset.machine.openshift.io/mycluster-2w5xs-worker-0-bogus-servergroup scaled $ oc logs machine-api-controllers-55597bc8dd-ffgmd -c machine-controller|tail -1 I0408 08:25:20.241238 1 controller.go:270] mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: machine deletion successful $ oc get machines NAME PHASE TYPE REGION ZONE AGE mycluster-2w5xs-master-0 Running ocp4.master regionOne nova 66m mycluster-2w5xs-master-1 Running ocp4.master regionOne nova 66m mycluster-2w5xs-master-2 Running ocp4.master regionOne nova 66m mycluster-2w5xs-worker-0-8tslk Running ocp4.master regionOne nova 57m mycluster-2w5xs-worker-0-hskzz Running ocp4.master regionOne nova 57m mycluster-2w5xs-worker-0-wxbvl Running ocp4.master regionOne nova 57m 7. confirm port is still present $ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status +-------------------------------------------------------------------------------------------+--------+ | Name | Status | +-------------------------------------------------------------------------------------------+--------+ | mycluster-2w5xs-ingress-port | DOWN | | mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | DOWN | <<<-- | mycluster-2w5xs-master-0 | ACTIVE | | mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-master-2 | ACTIVE | | mycluster-2w5xs-master-1 | ACTIVE | | mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-api-port | DOWN | +-------------------------------------------------------------------------------------------+--------+ Actual results: The OSP port created during machine provisioning is not cleaned up after OSP instance creation fails and OCP machine is deleted. Expected results: OpenStack ports bound to a failed machine are deleted if provisioning the instance fails. Additional info:
Note that this is a legacy CAPO bug, not a MAPO bug, because it's reported against 4.9. I would hope that this bug isn't present in MAPO, which uses upstream CAPO for server creation. Upstream CAPO has unit tests covering this exact scenario: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/6ba04de45920c102886bdeeeb21bf1a1119c5967/pkg/cloud/services/compute/instance_test.go#L743-L783
This is a similar bug to https://bugzilla.redhat.com/show_bug.cgi?id=1943378 for Cinder volumes. I confirm that this bug won't be present in 4.11. We'll work on fixing it for 4.10 and 4.9 as requested.
Verified based on the reproduction steps OCP 4.11.0-0.nightly-2022-04-25-220649 OSP RHOS-16.2-RHEL-8-20220311.n.1 $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-ngctf-master-0 Running m4.xlarge regionOne nova 25h openshift-machine-api ostest-ngctf-master-1 Running m4.xlarge regionOne nova 25h openshift-machine-api ostest-ngctf-master-2 Running m4.xlarge regionOne nova 25h openshift-machine-api ostest-ngctf-worker-0-2w89r Failed 14m $ oc describe machine/ostest-ngctf-worker-0-2w89r -n openshift-machine-api ... Error when looking up server group with ID foobar: Resource not found: [GET https://10.0.0.101:13774/v2.1/os-server-groups/foobar], error message: {"itemNotFound": {"code": 404, "message": "Instance group foobar could not be found."}}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069