Hide Forgot
+++ This bug was initially created as a clone of Bug #2073398 +++ Description of problem: When scaling out a machineset fails in provisioning phase (e.g. caused by misconfiguration in the machineset) the OSP port created as part of provisioning are not cleaned up. Version-Release number of selected component (if applicable): 4.9.27 How reproducible: 100% Steps to Reproduce: 1. verify existing ports on new cluster $ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status +-------------------------------------------------------------------------+--------+ | Name | Status | +-------------------------------------------------------------------------+--------+ | mycluster-2w5xs-ingress-port | DOWN | | mycluster-2w5xs-master-0 | ACTIVE | | mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-master-2 | ACTIVE | | mycluster-2w5xs-master-1 | ACTIVE | | mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-api-port | DOWN | +-------------------------------------------------------------------------+--------+ 2. create machineset with bogus serverGroupID $ oc get machineset mycluster-2w5xs-worker-0 -o yaml > /tmp/machineset.yaml # rename mycluster-2w5xs-worker-0 to mycluster-2w5xs-worker-0-bogus-servergroup # remove status and version fields # decrease replicas 3 to 1 # introduce bogus serverGroupID $ vi /tmp/machineset.yaml $ yq '.spec.template.spec.providerSpec.value.serverGroupID' < /tmp/machineset.yaml abcd-1234 $ oc apply -f /tmp/machineset.yaml machineset.machine.openshift.io/mycluster-2w5xs-worker-0-bogus-servergroup created 3. confirm new machines is provisioned $ oc get machines NAME PHASE TYPE REGION ZONE AGE mycluster-2w5xs-master-0 Running ocp4.master regionOne nova 59m mycluster-2w5xs-master-1 Running ocp4.master regionOne nova 59m mycluster-2w5xs-master-2 Running ocp4.master regionOne nova 59m mycluster-2w5xs-worker-0-8tslk Running ocp4.master regionOne nova 51m mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q Provisioning 40s <<<-- mycluster-2w5xs-worker-0-hskzz Running ocp4.master regionOne nova 51m mycluster-2w5xs-worker-0-wxbvl Running ocp4.master regionOne nova 51m 4. confirm provisioning fails due to invalid serverGroupID $ oc logs machine-api-controllers-55597bc8dd-ffgmd -c machine-controller <...> E0408 08:21:21.263391 1 actuator.go:574] Machine error mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: error creating Openstack instance: Group must be a UUID W0408 08:21:21.263504 1 controller.go:366] mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: failed to create machine: error creating Openstack instance: Group must be a UUID E0408 08:21:21.263596 1 controller.go:304] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Group must be a UUID" "name"="mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q" "namespace"="openshift-machine-api" <...> 5. confirm new port is created $ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status +-------------------------------------------------------------------------------------------+--------+ | Name | Status | +-------------------------------------------------------------------------------------------+--------+ | mycluster-2w5xs-ingress-port | DOWN | | mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | DOWN | <<<-- | mycluster-2w5xs-master-0 | ACTIVE | | mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-master-2 | ACTIVE | | mycluster-2w5xs-master-1 | ACTIVE | | mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-api-port | DOWN | +-------------------------------------------------------------------------------------------+--------+ 6. scale down machineset $ oc scale machineset mycluster-2w5xs-worker-0-bogus-servergroup --replicas=0 machineset.machine.openshift.io/mycluster-2w5xs-worker-0-bogus-servergroup scaled $ oc logs machine-api-controllers-55597bc8dd-ffgmd -c machine-controller|tail -1 I0408 08:25:20.241238 1 controller.go:270] mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: machine deletion successful $ oc get machines NAME PHASE TYPE REGION ZONE AGE mycluster-2w5xs-master-0 Running ocp4.master regionOne nova 66m mycluster-2w5xs-master-1 Running ocp4.master regionOne nova 66m mycluster-2w5xs-master-2 Running ocp4.master regionOne nova 66m mycluster-2w5xs-worker-0-8tslk Running ocp4.master regionOne nova 57m mycluster-2w5xs-worker-0-hskzz Running ocp4.master regionOne nova 57m mycluster-2w5xs-worker-0-wxbvl Running ocp4.master regionOne nova 57m 7. confirm port is still present $ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status +-------------------------------------------------------------------------------------------+--------+ | Name | Status | +-------------------------------------------------------------------------------------------+--------+ | mycluster-2w5xs-ingress-port | DOWN | | mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | DOWN | <<<-- | mycluster-2w5xs-master-0 | ACTIVE | | mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-master-2 | ACTIVE | | mycluster-2w5xs-master-1 | ACTIVE | | mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508 | ACTIVE | | | DOWN | | mycluster-2w5xs-api-port | DOWN | +-------------------------------------------------------------------------------------------+--------+ Actual results: The OSP port created during machine provisioning is not cleaned up after OSP instance creation fails and OCP machine is deleted. Expected results: OpenStack ports bound to a failed machine are deleted if provisioning the instance fails. Additional info: --- Additional comment from mbooth on 2022-04-08 13:38:49 UTC --- Note that this is a legacy CAPO bug, not a MAPO bug, because it's reported against 4.9. I would hope that this bug isn't present in MAPO, which uses upstream CAPO for server creation. Upstream CAPO has unit tests covering this exact scenario: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/6ba04de45920c102886bdeeeb21bf1a1119c5967/pkg/cloud/services/compute/instance_test.go#L743-L783 --- Additional comment from emacchi on 2022-04-08 14:06:05 UTC --- This is a similar bug to https://bugzilla.redhat.com/show_bug.cgi?id=1943378 for Cinder volumes. I confirm that this bug won't be present in 4.11. We'll work on fixing it for 4.10 and 4.9 as requested.
Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing
Verified on 4.10.3 on top of RHOS-16.2-RHEL-8-20220311.n.1. On a running cluster with 1 single worker: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.13 True False 140m Cluster version is 4.10.13 $ openstack port list --network ostest-tq67q-openshift -c Name -c Status +------------------------------------------------------------------+--------+ | Name | Status | +------------------------------------------------------------------+--------+ | ostest-tq67q-master-2 | ACTIVE | | ostest-tq67q-api-port | DOWN | | ostest-tq67q-worker-0-zzs25-51fbcc18-5e0d-423f-8d62-8a1b2c561da5 | ACTIVE | | ostest-tq67q-master-0 | ACTIVE | | ostest-tq67q-master-1 | ACTIVE | | | DOWN | | ostest-tq67q-ingress-port | DOWN | | | ACTIVE | +------------------------------------------------------------------+--------+ Creating new machine set setting a bogus serverGroupID: $ oc get machineset -n openshift-machine-api ostest-tq67q-worker-0 -o yaml > new_machineset.yaml $ vi new_machineset.yaml yq '.spec.template.spec.providerSpec.value.serverGroupID' < new_machineset.yaml "abcd-1234" Applying the change: $ oc apply -f new_machineset.yaml machineset.machine.openshift.io/ostest-tq67q-worker-0-bogus-servergroup created $ oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE ostest-tq67q-worker-0 1 1 1 1 169m ostest-tq67q-worker-0-bogus-servergroup 1 1 2m49s $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ostest-tq67q-master-0 Running 176m ostest-tq67q-master-1 Running 176m ostest-tq67q-master-2 Running 176m ostest-tq67q-worker-0-bogus-servergroup-vng8p Provisioning 76s ostest-tq67q-worker-0-zzs25 Running m4.xlarge regionOne nova 171m $ oc logs -n openshift-machine-api machine-api-controllers-55b5559cdb-zffn4 -c machine-controller [...] E0505 12:23:20.641945 1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Group must be a UUID" "name"="ostest-tq67q-worker-0-bogus-servergroup-qsbtl" "namespace"="openshift-machine-api" I0505 12:24:42.563174 1 controller.go:175] ostest-tq67q-worker-0-bogus-servergroup-qsbtl: reconciling Machine I0505 12:24:43.081466 1 controller.go:386] ostest-tq67q-worker-0-bogus-servergroup-qsbtl: reconciling machine triggers idempotent create >>> I0505 12:24:45.976603 1 machineservice.go:700] Deleted stale trunk "e1afa4ff-a0f6-487c-a36d-46257d405ea6" >>> I0505 12:24:46.731079 1 machineservice.go:674] Deleted stale port "0f36f937-6ca6-42b1-8023-201a4b9854e2" I0505 12:24:46.731644 1 logr.go:252] events "msg"="Warning" "message"="CreateError" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ostest-tq67q-worker-0-bogus-servergroup-qsbtl","uid":"3af985e4-51c9-4eff-8444-bd5afdc6aae8","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"120542"} "reason"="FailedCreate" E0505 12:24:46.763590 1 actuator.go:415] Machine error ostest-tq67q-worker-0-bogus-servergroup-qsbtl: error creating Openstack instance: Group must be a UUID W0505 12:24:46.763653 1 controller.go:388] ostest-tq67q-worker-0-bogus-servergroup-qsbtl: failed to create machine: error creating Openstack instance: Group must be a UUID E0505 12:24:46.763781 1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Group must be a UUID" "name"="ostest-tq67q-worker-0-bogus-servergroup-qsbtl" "namespace"="openshift-machine-api" The port for the bogus instance is appearing for a moment, but then it is removed after ethe failure in the instance creation: $ openstack port list --network ostest-tq67q-openshift -c Name -c Status +------------------------------------------------------------------------------------+--------+ | Name | Status | +------------------------------------------------------------------------------------+--------+ | ostest-tq67q-master-2 | ACTIVE | | ostest-tq67q-api-port | DOWN | | ostest-tq67q-worker-0-zzs25-51fbcc18-5e0d-423f-8d62-8a1b2c561da5 | ACTIVE | | ostest-tq67q-worker-0-bogus-servergroup-qsbtl-51fbcc18-5e0d-423f-8d62-8a1b2c561da5 | DOWN | | ostest-tq67q-master-0 | ACTIVE | | ostest-tq67q-master-1 | ACTIVE | | | DOWN | | ostest-tq67q-ingress-port | DOWN | | | ACTIVE | +------------------------------------------------------------------------------------+--------+ ...after few seconds: $ openstack port list --network ostest-tq67q-openshift -c Name -c Status +------------------------------------------------------------------+--------+ | Name | Status | +------------------------------------------------------------------+--------+ | ostest-tq67q-master-2 | ACTIVE | | ostest-tq67q-api-port | DOWN | | ostest-tq67q-worker-0-zzs25-51fbcc18-5e0d-423f-8d62-8a1b2c561da5 | ACTIVE | | ostest-tq67q-master-0 | ACTIVE | | ostest-tq67q-master-1 | ACTIVE | | | DOWN | | ostest-tq67q-ingress-port | DOWN | | | ACTIVE | +------------------------------------------------------------------+--------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.13 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1690