Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1983612

Summary: When using boot-from-volume "image", InstanceCreate leaks volumes in case machine-controller is rebooted
Product: OpenShift Container Platform Reporter: Pierre Prinetti <pprinett>
Component: Cloud ComputeAssignee: Pierre Prinetti <pprinett>
Cloud Compute sub component: OpenStack Provider QA Contact: Itzik Brown <itbrown>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: medium CC: adduarte, egarcia, itbrown, m.andre, mfedosin, pprinett
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: cluster-api-provider-openstack created instance in a non-idempotent manner when it comes to volumes Consequence: In case of a sudden crash of cluster-api-provider-openstack after boot-volume creation, but before the corresponding instance, then the volume would be left unused in OpenStack. Fix: When creating a new instance, cluster-api-provider-openstack now cleans volumes with the target name before creating a new one Result: cluster-api-provider-openstack instance creation is idempotent when it comes to root volumes.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:39:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pierre Prinetti 2021-07-19 08:32:56 UTC
Description of problem:

When creating a new instance that is set to boot from a new volume, CAPO creates a volume with the passed image before creating the server. That volume is then attached to the server and its lifespan is bound to said server.

If machine-controller is interrupted after volume creation is initiated, but before the server is created, the volume leaks and is never cleaned up nor reused.

This case is not implausible, because volume creation takes a noticeable time, that is proportionate to the size of the source image.


How reproducible: With a good enough timing, this is reliably reproducible.


Steps to Reproduce:
Scale out a `machineset with a spec.template.spec.providerSpec.value.rootVolume` that has `sourceType: image`, wait for the new volume to appear in OpenStack, then kill machine-controller:

```
oc -n openshift-machine-api delete pod machine-api-controllers-<cluster_id> --now
```

When machine-controller is booted again, it will create a new volume before associating it with the server. The old volume with the same name is left untouched.

Actual results:
The volume created by the previous (interrupted) iteration of InstanceCreated is never cleaned up.


Expected results:
The volume created by the previous (interrupted) iteration of InstanceCreate is either pruned or reused.

Comment 1 Pierre Prinetti 2021-07-19 08:47:01 UTC
One solution could be to look for a volume by name before creating it in InstanceCreate. Since it's likely the result of an interrupted operation, we could delete and recreate.

One alternative would be to match the image checksum with the volume checksum, but this might be a terribly slow operation with a low chance of being useful.

The question to be solved is whether it is fine to delete all volumes named after the machine upon machine creation.

Comment 3 Itzik Brown 2021-07-29 13:30:20 UTC
Verified on:
OCP 4.9.0-0.nightly-2021-07-27-181211
OSP HOS-16.1-RHEL-8-20210604.n.0

Scaled worker-0 machineset
$ oc scale --replicas=2 machineset ostest-bhssb-worker-0 -n openshift-machine-api

Verified a new volumes was created
(shiftstack) [stack@undercloud-0 ~]$ openstack volume list |grep worker-0                                             
| b1197586-047d-46d7-b702-7125d305b8ac | ostest-bhssb-worker-0-749v9              | in-use |   25 | Attached to ostest-bhssb-worker-0-749v9 on /dev/vda  |
| c16d25e9-4cbd-476b-9963-c4f30b58ae50 | pvc-f110027a-f981-4e95-9964-a0e5d93b0f93 | in-use |  100 | Attached to ostest-bhssb-worker-0-zvtw4 on /dev/vdb  |
| 330546f5-661d-4b8a-8349-7febedc82f79 | ostest-bhssb-worker-0-zvtw4              | in-use |   25 | Attached to ostest-bhssb-worker-0-zvtw4 on /dev/vda  |

Deleted the machine-api-controller
$ oc delete pod  -n openshift-machine-api machine-api-controllers-f85979566-wkjlf

Checked the new instance was created
$ openstack server list |grep worker-0                                             
| c75e5125-1960-49cf-98ad-8e94e167e266 | ostest-bhssb-worker-0-749v9 | ACTIVE | StorageNFS=172.17.5.199; ostest-bhssb-openshift=10.196.1.242 |       |        |
| 4e5b73e1-760e-4c26-bc0c-80a205ff653f | ostest-bhssb-worker-0-zvtw4 | ACTIVE | StorageNFS=172.17.5.176; ostest-bhssb-openshift=10.196.0.104 |       |        |

Checked that a new worker was created and no new volume got created

$ oc get nodes
NAME                          STATUS   ROLES    AGE   VERSION                                                         
ostest-bhssb-master-0         Ready    master   24h   v1.21.1+8268f88                                                 
ostest-bhssb-master-1         Ready    master   24h   v1.21.1+8268f88                                                 
ostest-bhssb-master-2         Ready    master   24h   v1.21.1+8268f88                                                 
ostest-bhssb-worker-0-749v9   Ready    worker   71s   v1.21.1+8268f88                                                 
ostest-bhssb-worker-0-zvtw4   Ready    worker   22h   v1.21.1+8268f88                                                 
ostest-bhssb-worker-1-6hgcn   Ready    worker   24h   v1.21.1+8268f88                                                 
ostest-bhssb-worker-2-7plmn   Ready    worker   24h   v1.21.1+8268f88

Comment 6 errata-xmlrpc 2021-10-18 17:39:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759