Bug 2073398 - machine-api-provider-openstack does not clean up OSP ports after failed server provisioning
Summary: machine-api-provider-openstack does not clean up OSP ports after failed serve...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Emilien Macchi
QA Contact: Itzik Brown
URL:
Whiteboard:
Depends On:
Blocks: 2077380
TreeView+ depends on / blocked
 
Reported: 2022-04-08 12:00 UTC by Bram Verschueren
Modified: 2022-08-10 11:05 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:05:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-openstack pull 232 0 None open Bug 2073398: Fix InstanceCreate port & trunk cleanup 2022-04-08 14:26:50 UTC
Red Hat Knowledge Base (Solution) 6957287 0 None None None 2022-06-03 09:34:17 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:05:41 UTC

Description Bram Verschueren 2022-04-08 12:00:38 UTC
Description of problem:
When scaling out a machineset fails in provisioning phase (e.g. caused by misconfiguration in the machineset) the OSP port created as part of provisioning are not cleaned up.

Version-Release number of selected component (if applicable):
4.9.27

How reproducible:
100%

Steps to Reproduce:
1. verify existing ports on new cluster
$ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status                                                                                   
+-------------------------------------------------------------------------+--------+
| Name                                                                    | Status |
+-------------------------------------------------------------------------+--------+
| mycluster-2w5xs-ingress-port                                            | DOWN   |
| mycluster-2w5xs-master-0                                                | ACTIVE |
| mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508     | ACTIVE |
| mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508     | ACTIVE |
|                                                                         | DOWN   |
| mycluster-2w5xs-master-2                                                | ACTIVE |
| mycluster-2w5xs-master-1                                                | ACTIVE |
| mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508     | ACTIVE |
|                                                                         | DOWN   |
| mycluster-2w5xs-api-port                                                | DOWN   |
+-------------------------------------------------------------------------+--------+
2. create machineset with bogus serverGroupID
$ oc get machineset mycluster-2w5xs-worker-0 -o yaml > /tmp/machineset.yaml                                                                                                        
# rename mycluster-2w5xs-worker-0 to mycluster-2w5xs-worker-0-bogus-servergroup
# remove status and version fields
# decrease replicas 3 to 1
# introduce bogus serverGroupID
$ vi /tmp/machineset.yaml
$ yq '.spec.template.spec.providerSpec.value.serverGroupID' < /tmp/machineset.yaml 
abcd-1234
$ oc apply -f /tmp/machineset.yaml 
machineset.machine.openshift.io/mycluster-2w5xs-worker-0-bogus-servergroup created
3. confirm new machines is provisioned
$ oc get machines                                              
NAME                                                   PHASE          TYPE                   REGION      ZONE   AGE
mycluster-2w5xs-master-0                           Running        ocp4.master   regionOne   nova   59m          
mycluster-2w5xs-master-1                           Running        ocp4.master   regionOne   nova   59m
mycluster-2w5xs-master-2                           Running        ocp4.master   regionOne   nova   59m
mycluster-2w5xs-worker-0-8tslk                     Running        ocp4.master   regionOne   nova   51m
mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q   Provisioning                                    40s     <<<--
mycluster-2w5xs-worker-0-hskzz                     Running        ocp4.master   regionOne   nova   51m
mycluster-2w5xs-worker-0-wxbvl                     Running        ocp4.master   regionOne   nova   51m    
4. confirm provisioning fails due to invalid serverGroupID
$ oc logs machine-api-controllers-55597bc8dd-ffgmd -c machine-controller
<...>
E0408 08:21:21.263391       1 actuator.go:574] Machine error mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: error creating Openstack instance: Group must be a UUID
W0408 08:21:21.263504       1 controller.go:366] mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: failed to create machine: error creating Openstack instance: Group must be a UUID
E0408 08:21:21.263596       1 controller.go:304] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Group must be a UUID" "name"="mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q" "namespace"="openshift-machine-api" 
<...>
5. confirm new port is created
$ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status
+-------------------------------------------------------------------------------------------+--------+
| Name                                                                                      | Status |
+-------------------------------------------------------------------------------------------+--------+
| mycluster-2w5xs-ingress-port                                                              | DOWN   |
| mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508     | DOWN   |    <<<--
| mycluster-2w5xs-master-0                                                                  | ACTIVE |
| mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
| mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
|                                                                                           | DOWN   |
| mycluster-2w5xs-master-2                                                                  | ACTIVE |
| mycluster-2w5xs-master-1                                                                  | ACTIVE |
| mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
|                                                                                           | DOWN   |
| mycluster-2w5xs-api-port                                                                  | DOWN   |
+-------------------------------------------------------------------------------------------+--------+
6. scale down machineset
$ oc scale machineset mycluster-2w5xs-worker-0-bogus-servergroup --replicas=0                                                                                 
machineset.machine.openshift.io/mycluster-2w5xs-worker-0-bogus-servergroup scaled
$ oc logs machine-api-controllers-55597bc8dd-ffgmd -c machine-controller|tail -1
I0408 08:25:20.241238       1 controller.go:270] mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q: machine deletion successful
$ oc get machines
NAME                                 PHASE     TYPE                   REGION      ZONE   AGE
mycluster-2w5xs-master-0         Running   ocp4.master   regionOne   nova   66m
mycluster-2w5xs-master-1         Running   ocp4.master   regionOne   nova   66m
mycluster-2w5xs-master-2         Running   ocp4.master   regionOne   nova   66m
mycluster-2w5xs-worker-0-8tslk   Running   ocp4.master   regionOne   nova   57m
mycluster-2w5xs-worker-0-hskzz   Running   ocp4.master   regionOne   nova   57m
mycluster-2w5xs-worker-0-wxbvl   Running   ocp4.master   regionOne   nova   57m

7. confirm port is still present
$ openstack port list --network mycluster-2w5xs-openshift -c Name -c Status
+-------------------------------------------------------------------------------------------+--------+
| Name                                                                                      | Status |
+-------------------------------------------------------------------------------------------+--------+
| mycluster-2w5xs-ingress-port                                                              | DOWN   |
| mycluster-2w5xs-worker-0-bogus-servergroup-7fs7q-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508     | DOWN   |    <<<--
| mycluster-2w5xs-master-0                                                                  | ACTIVE |
| mycluster-2w5xs-worker-0-8tslk-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
| mycluster-2w5xs-worker-0-hskzz-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
|                                                                                           | DOWN   |
| mycluster-2w5xs-master-2                                                                  | ACTIVE |
| mycluster-2w5xs-master-1                                                                  | ACTIVE |
| mycluster-2w5xs-worker-0-wxbvl-6298aa5d-db2c-4e99-ad6e-5dc1a0de8508                       | ACTIVE |
|                                                                                           | DOWN   |
| mycluster-2w5xs-api-port                                                                  | DOWN   |
+-------------------------------------------------------------------------------------------+--------+

Actual results:
The OSP port created during machine provisioning is not cleaned up after OSP instance creation fails and OCP machine is deleted.

Expected results:
OpenStack ports bound to a failed machine are deleted if provisioning the instance fails.

Additional info:

Comment 1 Matthew Booth 2022-04-08 13:38:49 UTC
Note that this is a legacy CAPO bug, not a MAPO bug, because it's reported against 4.9.

I would hope that this bug isn't present in MAPO, which uses upstream CAPO for server creation. Upstream CAPO has unit tests covering this exact scenario: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/6ba04de45920c102886bdeeeb21bf1a1119c5967/pkg/cloud/services/compute/instance_test.go#L743-L783

Comment 2 Emilien Macchi 2022-04-08 14:06:05 UTC
This is a similar bug to https://bugzilla.redhat.com/show_bug.cgi?id=1943378 for Cinder volumes.
I confirm that this bug won't be present in 4.11.

We'll work on fixing it for 4.10 and 4.9 as requested.

Comment 5 Itzik Brown 2022-04-27 13:15:38 UTC
Verified based on the reproduction steps

OCP 4.11.0-0.nightly-2022-04-25-220649
OSP RHOS-16.2-RHEL-8-20220311.n.1

$ oc get machines -A 
NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-ngctf-master-0         Running   m4.xlarge   regionOne   nova   25h
openshift-machine-api   ostest-ngctf-master-1         Running   m4.xlarge   regionOne   nova   25h
openshift-machine-api   ostest-ngctf-master-2         Running   m4.xlarge   regionOne   nova   25h
openshift-machine-api   ostest-ngctf-worker-0-2w89r   Failed                                   14m

$ oc describe machine/ostest-ngctf-worker-0-2w89r -n openshift-machine-api
...
Error when looking up server group with ID foobar: Resource not found: [GET https://10.0.0.101:13774/v2.1/os-server-groups/foobar], error message: {"itemNotFound": {"code": 404, "message": "Instance group foobar could not be found."}}

Comment 7 errata-xmlrpc 2022-08-10 11:05:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.