1943378 – OpenStack machine_controller does not remove boot volumes when reconciler errors

Bug 1943378 - OpenStack machine_controller does not remove boot volumes when reconciler errors

Summary: OpenStack machine_controller does not remove boot volumes when reconciler errors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6.z
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Pierre Prinetti
QA Contact:	rlobillo
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1916596 1952364 (view as bug list)
Depends On:
Blocks:	1985015
TreeView+	depends on / blocked

Reported:	2021-03-25 23:03 UTC by Andrew Collins
Modified:	2024-10-01 17:47 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When a server is set to boot from volume, and an image ID is passed, CAPO creates the root volume prior to creating the server. The root volume is set to be destroyed when the associated server is destroyed. However, if the server fails to create (notably because of quota limits), the volume is never associated to a server and the automatic deletion is never triggered. Consequence: When Server quota limit is hit, at every round of server creation retry, a new volume is created, possibly until volume quota is reached (or server creation is successful). This results in a leakage of unused volumes. Fix: With this fix, a newly created root volume is explicitly deleted as soon as the server creation call fails. Result: When server creation fails (notably because of quota limits), the pre-created root volume is deleted before the next retry.
Clone Of:
Clones:	1985015 (view as bug list)
Environment:
Last Closed:	2021-10-18 17:29:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-openstack pull 188	None	Merged	Bug 1943378: Eliminate instanceCreate volume leak	2021-09-16 08:03:02 UTC
Github	openshift cluster-api-provider-openstack pull 191	None	Merged	Bug 1943378: Fix InstanceCreate volume cleanup	2021-09-16 08:03:02 UTC
Red Hat Product Errata	RHSA-2021:3759	None	None	None	2021-10-18 17:30:13 UTC

Description Andrew Collins 2021-03-25 23:03:14 UTC

Created attachment 1766444 [details]
machine-controller log

Description of problem:
OpenShift provisioning OpenStack instances error while provisioning, leading to an endless loop of provisioning new openstack instances and corresponding boot volumes.

The boot volumes are never cleaned up, so the volume quota on the tenant is eventually reached, so we see volume quota reached errors.


Version-Release number of selected component (if applicable):
OCP 4.7.0
OSP 13


How reproducible:
Reliable.
Two clusters showing same behavior

Steps to Reproduce:
1. Set ram and volume quota on openstack project
2. Create openshift machines totaling higher than ram quota.

Actual results:
1. Watch as volumes accumulate

2. Machine-controller begins reporting volume quota reached errors.
```
W0325 05:21:05.306175       1 controller.go:317] dev-rtp-101-5zn4j-ext-kube-rp-1c-6792t: failed to create machine: error creating Openstack instance: Create bootable volume err: Expected HTTP response code [202] when accessing [POST https://cloud-rtp-1.custy.com:8776/v3/bfc169221f6c44a1a1161fdb00302113/volumes], but got 413 instead
{"overLimit": {"message": "VolumeLimitExceeded: Maximum number of volumes allowed (100) exceeded for quota 'volumes_ceph-standard'.", "code": 413, "retryAfter": "0"}}
```

Expected results:
boot volumes created during the reconcile loop that errors should be cleaned up when provision is not successful.
Boot volumes are not left over.


Additional info:

Comment 1 Andrew Collins 2021-03-25 23:07:15 UTC

Created attachment 1766445 [details]
openstack volume list of dangling volumes.

Added output from openstack to show the volumes for nodes that failed to provision successfully.

```
# openstack volume list --long -f value -c Name -c Status  |awk '{print $2}' | sort | uniq -c
     88 available
     12 in-use
```

```
I0324 16:08:48.442116       1 machineservice.go:602] Creating a bootable volume from image dev-rtp-101-5zn4j-rhcos.
I0324 16:09:54.369526       1 machineservice.go:636] Bootable volume ef999343-08ea-4b1f-8e3a-fede1daff4af was created successfully.
E0324 16:09:56.899941       1 actuator.go:574] Machine error dev-rtp-101-5zn4j-ext-kube-rp-1c-6792t: error creating Openstack instance: Create new server err: Request forbidden: [POST https://cloud-rtp-1.custy.com:8774/v2.1/servers], error message: {"forbidden": {"message": "Quota exceeded for ram: Requested 32768, but already used 501760 of 532480 ram", "code": 403}}
W0324 16:09:56.899970       1 controller.go:317] dev-rtp-101-5zn4j-ext-kube-rp-1c-6792t: failed to create machine: error creating Openstack instance: Create new server err: Request forbidden: [POST https://cloud-rtp-1.custy.com:8774/v2.1/servers], error message: {"forbidden": {"message": "Quota exceeded for ram: Requested 32768, but already used 501760 of 532480 ram", "code": 403}}
E0324 16:09:56.900033       1 controller.go:237] controller "msg"="Reconciler error" "error"="error creating Openstack instance: Create new server err: Request forbidden: [POST https://cloud-rtp-1.custy.com:8774/v2.1/servers], error message: {\"forbidden\": {\"message\": \"Quota exceeded for ram: Requested 32768, but already used 501760 of 532480 ram\", \"code\": 403}}" "controller"="machine_controller" "name"="dev-rtp-101-5zn4j-ext-kube-rp-1c-6792t" "namespace"="openshift-machine-api" 
```

Comment 2 Andrew Collins 2021-03-26 18:07:47 UTC

I see the symptom of many 'available' volumes, possible the same condition in machine-controller in https://bugzilla.redhat.com/show_bug.cgi?id=1916596#c3

Comment 3 Andrew Collins 2021-03-26 18:08:18 UTC

I see the symptom of many 'available' volumes, possible the same condition in machine-controller in https://bugzilla.redhat.com/show_bug.cgi?id=1916596#c3

Comment 4 Yu Qi Zhang 2021-03-29 16:30:31 UTC

Machine-controller is machine-api, not machine-config. Reassigning

Comment 5 Matthew Booth 2021-03-30 10:43:53 UTC

This bug is valid. There are 2 issues here:

1. InstanceCreate does not check for an existing available volume with the same name before creating a new one.
2. InstanceDelete does not check that the volume has been deleted before returning success.

In the latter case, the volume will normally be deleted with the instance, but if the instance was never created then this will not happen.

This bug also exists upstream.

Comment 6 Matthew Booth 2021-03-30 12:26:07 UTC

*** Bug 1916596 has been marked as a duplicate of this bug. ***

Comment 8 Martin André 2021-04-28 12:36:55 UTC

*** Bug 1952364 has been marked as a duplicate of this bug. ***

Comment 9 Martin André 2021-04-28 12:41:56 UTC

The bug is present in 4.6 as well.

Comment 15 Pierre Prinetti 2021-07-16 09:01:26 UTC

Here I answer a private comment asking whether 4.8.0 will contain the fix.

The answer is:
4.8 Code freeze was 2021-06-11. There is no chance that the fix will be in 4.8.0 (GA 2021-07-27).
This bug will be fixed in the 4.9 development branch and backported to earlier versions' z-streams as far as we realistically can.

Comment 16 Matthew Booth 2021-07-21 12:22:57 UTC

Unfortunately this isn't quite fixed yet: https://github.com/openshift/cluster-api-provider-openstack/pull/188#discussion_r673919014

Comment 18 Pierre Prinetti 2021-07-28 08:14:09 UTC

> this isn't quite fixed yet

It should be now.

Comment 19 Andrew Collins 2021-07-28 16:02:29 UTC

I know this BZ was opened specifically for the volumes that are leaking, but we have also seen ports leak as well.
Instance quota leads to volume leaking.
Repeat volume leaking eventually leads to port leaking.

I think that hints to an overall atomic commit/rollback that must happen when openstack actuator creates the instance, port, volume, (anything else etc.).

I understand if the port needs a separate BZ, but want to make sure it is tracked nonetheless.

Comment 20 rlobillo 2021-07-29 12:29:59 UTC

Verified on 4.9.0-0.nightly-2021-07-27-181211 on OSP16.1 (RHOS-16.1-RHEL-8-20210604.n.0).

Creating a new worker which is reaching the RAM qouta for the project and includes a rootVolume, provokes that the volume is created on openstack. However, once the Machine creation failsthe volume is removed on OSP.

$ oc apply -f new_worker.yaml # content of the file below.

$ oc logs -n openshift-machine-api machine-api-controllers-f85979566-wkjlf machine-controller -f
[...]
I0728 16:18:45.389205       1 controller.go:174] ostest-bhssb-worker-new: reconciling Machine
I0728 16:18:46.641397       1 controller.go:364] ostest-bhssb-worker-new: reconciling machine triggers idempotent create                                                                                                                     
I0728 16:18:51.959617       1 machineservice.go:726] Creating bootable volume with name "ostest-bhssb-worker-new" from image "rhcos".                                                                                                        
I0728 16:19:51.554436       1 machineservice.go:782] Bootable volume 25b8ae77-16d6-400e-b71a-7af1ccbd6f94 was created successfully.                                                                                                          
I0728 16:19:53.980340       1 machineservice.go:773] Deleted stale volume "25b8ae77-16d6-400e-b71a-7af1ccbd6f94"
E0728 16:19:53.997504       1 actuator.go:574] Machine error ostest-bhssb-worker-new: error creating Openstack instance: Request forbidden: [POST https://overcloud.redhat.local:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 114688 of 114688 ram"}}
W0728 16:19:53.997534       1 controller.go:366] ostest-bhssb-worker-new: failed to create machine: error creating Openstack instance: Request forbidden: [POST https://overcloud.redhat.local:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 114688 of 114688 ram"}}
E0728 16:19:53.997648       1 controller.go:302] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Request forbidden: [POST https://overcloud.redhat.local:13774/v2.1/servers], error message: {\"forbidden\": {\"code\": 403, \"message\": \"Quota exceeded for ram: Requested 16384, but already used 114688 of 114688 ram\"}}" "name"="ostest-bhssb-worker-new" "namespace"="openshift-machine-api"         

The above process repeats indefinitely and all resources are removed and the cluster remains stable:

$ openstack volume list  -f value | grep available | wc -l
1
$ openstack volume list  -f value | grep available | wc -l
0

It is observed in 4.6 (4.6.0-0.nightly-2021-07-09-014429) that the same procedure causes excess volumes in each iteration of reconciliation:

I0728 16:14:59.914885       1 machineservice.go:624] Creating a bootable volume from image cirros.                                                                                           
I0728 16:15:06.495430       1 machineservice.go:658] Bootable volume c873c047-82f4-4fb9-8e0c-56da98cbff52 was created successfully.                                                          
E0728 16:15:08.691085       1 actuator.go:550] Machine error ostest-jg7gx-worker-0-new: error creating Openstack instance: Create new server err: Request forbidden: [POST https://10.0.0.101:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 98304 of 98304 ram"}}                                  
W0728 16:15:08.692644       1 controller.go:316] ostest-jg7gx-worker-0-new: failed to create machine: error creating Openstack instance: Create new server err: Request forbidden: [POST https://10.0.0.101:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 98304 of 98304 ram"}}                    
I0728 16:15:09.693229       1 controller.go:170] ostest-jg7gx-worker-0-new: reconciling Machine                                                                                              
I0728 16:15:12.034713       1 controller.go:314] ostest-jg7gx-worker-0-new: reconciling machine triggers idempotent create                                               

$ openstack volume list  -f value | grep available | wc -l
37


####

$ cat new_worker.yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
  labels:
    machine.openshift.io/cluster-api-cluster: ostest-bhssb
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/instance-type: m4.xlarge
    machine.openshift.io/region: regionOne
    machine.openshift.io/zone: AZhci-0
  name: ostest-bhssb-worker-new
  namespace: openshift-machine-api
spec:
  metadata: {}
  providerSpec:
    value:
      apiVersion: openstackproviderconfig.openshift.io/v1alpha1
      availabilityZone: AZhci-0
      cloudName: openstack
      cloudsSecret:
        name: openstack-cloud-credentials
        namespace: openshift-machine-api
      flavor: m4.xlarge
      image: ""
      kind: OpenstackProviderSpec
      metadata:
        creationTimestamp: null
      networks:
      - filter: {}
        subnets:
        - filter:
            name: ostest-bhssb-nodes
            tags: openshiftClusterID=ostest-bhssb
      - filter: {}
        noAllowedAddressPairs: true
        uuid: 8be7adde-dbf0-49d7-b0f1-11e583278172
      rootVolume:
        availabilityZone: cinderAZ1
        deviceType: ""
        diskSize: 25
        sourceType: image
        sourceUUID: rhcos
        volumeType: tripleo
      securityGroups:
      - filter: {}
        name: ostest-bhssb-worker
      serverMetadata:
        Name: ostest-bhssb-worker
        openshiftClusterID: ostest-bhssb
      tags:
      - openshiftClusterID=ostest-bhssb
      trunk: true
      userDataSecret:
        name: worker-user-data

Comment 24 errata-xmlrpc 2021-10-18 17:29:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.