2077380 – machine-api-provider-openstack does not clean up OSP ports after failed server provisioning

Bug 2077380 - machine-api-provider-openstack does not clean up OSP ports after failed server provisioning

Summary: machine-api-provider-openstack does not clean up OSP ports after failed serve...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Emilien Macchi
QA Contact:	rlobillo
Docs Contact:
URL:
Whiteboard:
Depends On:	2073398
Blocks:	2077381
TreeView+	depends on / blocked

Reported:	2022-04-21 08:57 UTC by OpenShift BugZilla Robot
Modified:	2022-05-12 00:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-11 10:31:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-openstack pull 233	0	None	open	[release-4.10] Bug 2077380: Fix InstanceCreate port & trunk cleanup	2022-04-27 16:29:43 UTC
Red Hat Product Errata	RHBA-2022:1690	0	None	None	None	2022-05-11 10:32:12 UTC

Description OpenShift BugZilla Robot 2022-04-21 08:57:55 UTC

+++ This bug was initially created as a clone of Bug #2073398 +++

Description of problem:
When scaling out a machineset fails in provisioning phase (e.g. caused by misconfiguration in the machineset) the OSP port created as part of provisioning are not cleaned up.

Version-Release number of selected component (if applicable):
4.9.27

How reproducible:
100%

Steps to Reproduce:
1. verify existing ports on new cluster
$ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status                                                                                   
+-------------------------------------------------------------------------+--------+
| Name                                                                    | Status |
+-------------------------------------------------------------------------+--------+
| mycluster-2w5xs-ingress-port                                            | DOWN   |
| mycluster-2w5xs-master-0                                                | ACTIVE |
| mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508     | ACTIVE |
| mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508     | ACTIVE |
|                                                                         | DOWN   |
| mycluster-2w5xs-master-2                                                | ACTIVE |
| mycluster-2w5xs-master-1                                                | ACTIVE |
| mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508     | ACTIVE |
|                                                                         | DOWN   |
| mycluster-2w5xs-api-port                                                | DOWN   |
+-------------------------------------------------------------------------+--------+
2. create machineset with bogus serverGroupID
$ oc get machineset mycluster-2w5xs-worker-0 -o yaml > /tmp/machineset.yaml                                                                                                        
# rename mycluster-2w5xs-worker-0 to mycluster-2w5xs-worker-0-bogus-servergroup
# remove status and version fields
# decrease replicas 3 to 1
# introduce bogus serverGroupID
$ vi /tmp/machineset.yaml
$ yq '.spec.template.spec.providerSpec.value.serverGroupID' < /tmp/machineset.yaml 
abcd-1234
$ oc apply -f /tmp/machineset.yaml 
machineset.machine.openshift.io/mycluster-2w5xs-worker-0-bogus-servergroup created
3. confirm new machines is provisioned
$ oc get machines                                              
NAME                                                   PHASE          TYPE                   REGION      ZONE   AGE
mycluster-2w5xs-master-0                           Running        ocp4.master   regionOne   nova   59m          
mycluster-2w5xs-master-1                           Running        ocp4.master   regionOne   nova   59m
mycluster-2w5xs-master-2                           Running        ocp4.master   regionOne   nova   59m
mycluster-2w5xs-worker-0-8tslk                     Running        ocp4.master   regionOne   nova   51m
mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q   Provisioning                                    40s     <<<--
mycluster-2w5xs-worker-0-hskzz                     Running        ocp4.master   regionOne   nova   51m
mycluster-2w5xs-worker-0-wxbvl                     Running        ocp4.master   regionOne   nova   51m    
4. confirm provisioning fails due to invalid serverGroupID
$ oc logs machine-api-controllers-55597bc8dd-ffgmd -c machine-controller
<...>
E0408 08:21:21.263391       1 actuator.go:574] Machine error mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: error creating Openstack instance: Group must be a UUID
W0408 08:21:21.263504       1 controller.go:366] mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: failed to create machine: error creating Openstack instance: Group must be a UUID
E0408 08:21:21.263596       1 controller.go:304] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Group must be a UUID" "name"="mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q" "namespace"="openshift-machine-api" 
<...>
5. confirm new port is created
$ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status
+-------------------------------------------------------------------------------------------+--------+
| Name                                                                                      | Status |
+-------------------------------------------------------------------------------------------+--------+
| mycluster-2w5xs-ingress-port                                                              | DOWN   |
| mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508     | DOWN   |    <<<--
| mycluster-2w5xs-master-0                                                                  | ACTIVE |
| mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
| mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
|                                                                                           | DOWN   |
| mycluster-2w5xs-master-2                                                                  | ACTIVE |
| mycluster-2w5xs-master-1                                                                  | ACTIVE |
| mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
|                                                                                           | DOWN   |
| mycluster-2w5xs-api-port                                                                  | DOWN   |
+-------------------------------------------------------------------------------------------+--------+
6. scale down machineset
$ oc scale machineset mycluster-2w5xs-worker-0-bogus-servergroup --replicas=0                                                                                 
machineset.machine.openshift.io/mycluster-2w5xs-worker-0-bogus-servergroup scaled
$ oc logs machine-api-controllers-55597bc8dd-ffgmd -c machine-controller|tail -1
I0408 08:25:20.241238       1 controller.go:270] mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: machine deletion successful
$ oc get machines
NAME                                 PHASE     TYPE                   REGION      ZONE   AGE
mycluster-2w5xs-master-0         Running   ocp4.master   regionOne   nova   66m
mycluster-2w5xs-master-1         Running   ocp4.master   regionOne   nova   66m
mycluster-2w5xs-master-2         Running   ocp4.master   regionOne   nova   66m
mycluster-2w5xs-worker-0-8tslk   Running   ocp4.master   regionOne   nova   57m
mycluster-2w5xs-worker-0-hskzz   Running   ocp4.master   regionOne   nova   57m
mycluster-2w5xs-worker-0-wxbvl   Running   ocp4.master   regionOne   nova   57m

7. confirm port is still present
$ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status
+-------------------------------------------------------------------------------------------+--------+
| Name                                                                                      | Status |
+-------------------------------------------------------------------------------------------+--------+
| mycluster-2w5xs-ingress-port                                                              | DOWN   |
| mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508     | DOWN   |    <<<--
| mycluster-2w5xs-master-0                                                                  | ACTIVE |
| mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
| mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
|                                                                                           | DOWN   |
| mycluster-2w5xs-master-2                                                                  | ACTIVE |
| mycluster-2w5xs-master-1                                                                  | ACTIVE |
| mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
|                                                                                           | DOWN   |
| mycluster-2w5xs-api-port                                                                  | DOWN   |
+-------------------------------------------------------------------------------------------+--------+

Actual results:
The OSP port created during machine provisioning is not cleaned up after OSP instance creation fails and OCP machine is deleted.

Expected results:
OpenStack ports bound to a failed machine are deleted if provisioning the instance fails.

Additional info:

--- Additional comment from mbooth on 2022-04-08 13:38:49 UTC ---

Note that this is a legacy CAPO bug, not a MAPO bug, because it's reported against 4.9.

I would hope that this bug isn't present in MAPO, which uses upstream CAPO for server creation. Upstream CAPO has unit tests covering this exact scenario: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/6ba04de45920c102886bdeeeb21bf1a1119c5967/pkg/cloud/services/compute/instance_test.go#L743-L783

--- Additional comment from emacchi on 2022-04-08 14:06:05 UTC ---

This is a similar bug to https://bugzilla.redhat.com/show_bug.cgi?id=1943378 for Cinder volumes.
I confirm that this bug won't be present in 4.11.

We'll work on fixing it for 4.10 and 4.9 as requested.

Comment 1 ShiftStack Bugwatcher 2022-04-22 07:03:26 UTC

Removing the Triaged keyword because:
* the QE automation assessment (flag qe_test_coverage) is missing

Comment 5 rlobillo 2022-05-05 12:26:31 UTC

Verified on 4.10.3 on top of RHOS-16.2-RHEL-8-20220311.n.1.

On a running cluster with 1 single worker:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.13   True        False         140m    Cluster version is 4.10.13

$ openstack port list --network ostest-tq67q-openshift -c Name -c Status
+------------------------------------------------------------------+--------+                                                                                                                
| Name                                                             | Status |                                                                                                                
+------------------------------------------------------------------+--------+                                                                                                                
| ostest-tq67q-master-2                                            | ACTIVE |                                                                                                                
| ostest-tq67q-api-port                                            | DOWN   |                                                                                                                
| ostest-tq67q-worker-0-zzs25-51fbcc18-5e0d-423f-8d62-8a1b2c561da5 | ACTIVE |                                                                                                                
| ostest-tq67q-master-0                                            | ACTIVE |                                                                                                                
| ostest-tq67q-master-1                                            | ACTIVE |                                                                                                                
|                                                                  | DOWN   |                                                                                                                
| ostest-tq67q-ingress-port                                        | DOWN   |                                                                                                                
|                                                                  | ACTIVE |                                                                                                                
+------------------------------------------------------------------+--------+

Creating new machine set setting a bogus serverGroupID:

$ oc get machineset -n openshift-machine-api ostest-tq67q-worker-0 -o yaml > new_machineset.yaml
$ vi new_machineset.yaml
 yq '.spec.template.spec.providerSpec.value.serverGroupID' < new_machineset.yaml 
"abcd-1234"

Applying the change:

$ oc apply -f new_machineset.yaml         
machineset.machine.openshift.io/ostest-tq67q-worker-0-bogus-servergroup created
$ oc get machineset -n openshift-machine-api
NAME                                      DESIRED   CURRENT   READY   AVAILABLE   AGE
ostest-tq67q-worker-0                     1         1         1       1           169m
ostest-tq67q-worker-0-bogus-servergroup   1         1                             2m49s
$ oc get machine -n openshift-machine-api
NAME                                            PHASE          TYPE        REGION      ZONE   AGE
ostest-tq67q-master-0                           Running                                       176m
ostest-tq67q-master-1                           Running                                       176m
ostest-tq67q-master-2                           Running                                       176m
ostest-tq67q-worker-0-bogus-servergroup-vng8p   Provisioning                                  76s
ostest-tq67q-worker-0-zzs25                     Running        m4.xlarge   regionOne   nova   171m

$ oc logs -n openshift-machine-api machine-api-controllers-55b5559cdb-zffn4 -c machine-controller
[...]
E0505 12:23:20.641945       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Group must be a UUID" "name"="ostest-tq67q-worker-0-bogus-servergroup-qsbtl" "namespace"="openshift-machine-api" 
I0505 12:24:42.563174       1 controller.go:175] ostest-tq67q-worker-0-bogus-servergroup-qsbtl: reconciling Machine
I0505 12:24:43.081466       1 controller.go:386] ostest-tq67q-worker-0-bogus-servergroup-qsbtl: reconciling machine triggers idempotent create
>>> I0505 12:24:45.976603       1 machineservice.go:700] Deleted stale trunk "e1afa4ff-a0f6-487c-a36d-46257d405ea6"
>>> I0505 12:24:46.731079       1 machineservice.go:674] Deleted stale port "0f36f937-6ca6-42b1-8023-201a4b9854e2"
I0505 12:24:46.731644       1 logr.go:252] events "msg"="Warning"  "message"="CreateError" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ostest-tq67q-worker-0-bogus-servergroup-qsbtl","uid":"3af985e4-51c9-4eff-8444-bd5afdc6aae8","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"120542"} "reason"="FailedCreate"
E0505 12:24:46.763590       1 actuator.go:415] Machine error ostest-tq67q-worker-0-bogus-servergroup-qsbtl: error creating Openstack instance: Group must be a UUID
W0505 12:24:46.763653       1 controller.go:388] ostest-tq67q-worker-0-bogus-servergroup-qsbtl: failed to create machine: error creating Openstack instance: Group must be a UUID
E0505 12:24:46.763781       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Group must be a UUID" "name"="ostest-tq67q-worker-0-bogus-servergroup-qsbtl" "namespace"="openshift-machine-api" 

The port for the bogus instance is appearing for a moment, but then it is removed after ethe failure in the instance creation:

$ openstack port list --network ostest-tq67q-openshift -c Name -c Status                                                                                  
+------------------------------------------------------------------------------------+--------+
| Name                                                                               | Status |
+------------------------------------------------------------------------------------+--------+
| ostest-tq67q-master-2                                                              | ACTIVE |
| ostest-tq67q-api-port                                                              | DOWN   |
| ostest-tq67q-worker-0-zzs25-51fbcc18-5e0d-423f-8d62-8a1b2c561da5                   | ACTIVE |
| ostest-tq67q-worker-0-bogus-servergroup-qsbtl-51fbcc18-5e0d-423f-8d62-8a1b2c561da5 | DOWN   |                                                                                              
| ostest-tq67q-master-0                                                              | ACTIVE |
| ostest-tq67q-master-1                                                              | ACTIVE |
|                                                                                    | DOWN   |
| ostest-tq67q-ingress-port                                                          | DOWN   |
|                                                                                    | ACTIVE |
+------------------------------------------------------------------------------------+--------+

...after few seconds:

$ openstack port list --network ostest-tq67q-openshift -c Name -c Status
+------------------------------------------------------------------+--------+                                                                                                                
| Name                                                             | Status |                                                                                                                
+------------------------------------------------------------------+--------+                                                                                                                
| ostest-tq67q-master-2                                            | ACTIVE |                                                                                                                
| ostest-tq67q-api-port                                            | DOWN   |                                                                                                                
| ostest-tq67q-worker-0-zzs25-51fbcc18-5e0d-423f-8d62-8a1b2c561da5 | ACTIVE |                                                                                                                
| ostest-tq67q-master-0                                            | ACTIVE |                                                                                                                
| ostest-tq67q-master-1                                            | ACTIVE |                                                                                                                
|                                                                  | DOWN   |                                                                                                                
| ostest-tq67q-ingress-port                                        | DOWN   |                                                                                                                
|                                                                  | ACTIVE |                                                                                                                
+------------------------------------------------------------------+--------+

Comment 7 errata-xmlrpc 2022-05-11 10:31:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.13 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1690

Note You need to log in before you can comment on or make changes to this bug.