1983612 – When using boot-from-volume "image", InstanceCreate leaks volumes in case machine-controller is rebooted

Bug 1983612 - When using boot-from-volume "image", InstanceCreate leaks volumes in case machine-controller is rebooted

Summary: When using boot-from-volume "image", InstanceCreate leaks volumes in case mac...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Pierre Prinetti
QA Contact:	Itzik Brown
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-19 08:32 UTC by Pierre Prinetti
Modified:	2021-10-18 17:40 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: cluster-api-provider-openstack created instance in a non-idempotent manner when it comes to volumes Consequence: In case of a sudden crash of cluster-api-provider-openstack after boot-volume creation, but before the corresponding instance, then the volume would be left unused in OpenStack. Fix: When creating a new instance, cluster-api-provider-openstack now cleans volumes with the target name before creating a new one Result: cluster-api-provider-openstack instance creation is idempotent when it comes to root volumes.
Clone Of:
Environment:
Last Closed:	2021-10-18 17:39:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-openstack pull 189	0	None	open	Bug 1983612: Make InstanceCreate idempotent	2021-07-20 12:05:48 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:40:15 UTC

Description Pierre Prinetti 2021-07-19 08:32:56 UTC

Description of problem:

When creating a new instance that is set to boot from a new volume, CAPO creates a volume with the passed image before creating the server. That volume is then attached to the server and its lifespan is bound to said server.

If machine-controller is interrupted after volume creation is initiated, but before the server is created, the volume leaks and is never cleaned up nor reused.

This case is not implausible, because volume creation takes a noticeable time, that is proportionate to the size of the source image.


How reproducible: With a good enough timing, this is reliably reproducible.


Steps to Reproduce:
Scale out a `machineset with a spec.template.spec.providerSpec.value.rootVolume` that has `sourceType: image`, wait for the new volume to appear in OpenStack, then kill machine-controller:

```
oc -n openshift-machine-api delete pod machine-api-controllers-<cluster_id> --now
```

When machine-controller is booted again, it will create a new volume before associating it with the server. The old volume with the same name is left untouched.

Actual results:
The volume created by the previous (interrupted) iteration of InstanceCreated is never cleaned up.


Expected results:
The volume created by the previous (interrupted) iteration of InstanceCreate is either pruned or reused.

Comment 1 Pierre Prinetti 2021-07-19 08:47:01 UTC

One solution could be to look for a volume by name before creating it in InstanceCreate. Since it's likely the result of an interrupted operation, we could delete and recreate.

One alternative would be to match the image checksum with the volume checksum, but this might be a terribly slow operation with a low chance of being useful.

The question to be solved is whether it is fine to delete all volumes named after the machine upon machine creation.

Comment 3 Itzik Brown 2021-07-29 13:30:20 UTC

Verified on:
OCP 4.9.0-0.nightly-2021-07-27-181211
OSP HOS-16.1-RHEL-8-20210604.n.0

Scaled worker-0 machineset
$ oc scale --replicas=2 machineset ostest-bhssb-worker-0 -n openshift-machine-api

Verified a new volumes was created
(shiftstack) [stack@undercloud-0 ~]$ openstack volume list |grep worker-0                                             
| b1197586-047d-46d7-b702-7125d305b8ac | ostest-bhssb-worker-0-749v9              | in-use |   25 | Attached to ostest-bhssb-worker-0-749v9 on /dev/vda  |
| c16d25e9-4cbd-476b-9963-c4f30b58ae50 | pvc-f110027a-f981-4e95-9964-a0e5d93b0f93 | in-use |  100 | Attached to ostest-bhssb-worker-0-zvtw4 on /dev/vdb  |
| 330546f5-661d-4b8a-8349-7febedc82f79 | ostest-bhssb-worker-0-zvtw4              | in-use |   25 | Attached to ostest-bhssb-worker-0-zvtw4 on /dev/vda  |

Deleted the machine-api-controller
$ oc delete pod  -n openshift-machine-api machine-api-controllers-f85979566-wkjlf

Checked the new instance was created
$ openstack server list |grep worker-0                                             
| c75e5125-1960-49cf-98ad-8e94e167e266 | ostest-bhssb-worker-0-749v9 | ACTIVE | StorageNFS=172.17.5.199; ostest-bhssb-openshift=10.196.1.242 |       |        |
| 4e5b73e1-760e-4c26-bc0c-80a205ff653f | ostest-bhssb-worker-0-zvtw4 | ACTIVE | StorageNFS=172.17.5.176; ostest-bhssb-openshift=10.196.0.104 |       |        |

Checked that a new worker was created and no new volume got created

$ oc get nodes
NAME                          STATUS   ROLES    AGE   VERSION                                                         
ostest-bhssb-master-0         Ready    master   24h   v1.21.1+8268f88                                                 
ostest-bhssb-master-1         Ready    master   24h   v1.21.1+8268f88                                                 
ostest-bhssb-master-2         Ready    master   24h   v1.21.1+8268f88                                                 
ostest-bhssb-worker-0-749v9   Ready    worker   71s   v1.21.1+8268f88                                                 
ostest-bhssb-worker-0-zvtw4   Ready    worker   22h   v1.21.1+8268f88                                                 
ostest-bhssb-worker-1-6hgcn   Ready    worker   24h   v1.21.1+8268f88                                                 
ostest-bhssb-worker-2-7plmn   Ready    worker   24h   v1.21.1+8268f88

Comment 6 errata-xmlrpc 2021-10-18 17:39:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.