Created attachment 1766444 [details] machine-controller log Description of problem: OpenShift provisioning OpenStack instances error while provisioning, leading to an endless loop of provisioning new openstack instances and corresponding boot volumes. The boot volumes are never cleaned up, so the volume quota on the tenant is eventually reached, so we see volume quota reached errors. Version-Release number of selected component (if applicable): OCP 4.7.0 OSP 13 How reproducible: Reliable. Two clusters showing same behavior Steps to Reproduce: 1. Set ram and volume quota on openstack project 2. Create openshift machines totaling higher than ram quota. Actual results: 1. Watch as volumes accumulate 2. Machine-controller begins reporting volume quota reached errors. ``` W0325 05:21:05.306175 1 controller.go:317] dev-rtp-101-5zn4j-ext-kube-rp-1c-6792t: failed to create machine: error creating Openstack instance: Create bootable volume err: Expected HTTP response code [202] when accessing [POST https://cloud-rtp-1.custy.com:8776/v3/bfc169221f6c44a1a1161fdb00302113/volumes], but got 413 instead {"overLimit": {"message": "VolumeLimitExceeded: Maximum number of volumes allowed (100) exceeded for quota 'volumes_ceph-standard'.", "code": 413, "retryAfter": "0"}} ``` Expected results: boot volumes created during the reconcile loop that errors should be cleaned up when provision is not successful. Boot volumes are not left over. Additional info:
Created attachment 1766445 [details] openstack volume list of dangling volumes. Added output from openstack to show the volumes for nodes that failed to provision successfully. ``` # openstack volume list --long -f value -c Name -c Status |awk '{print $2}' | sort | uniq -c 88 available 12 in-use ``` ``` I0324 16:08:48.442116 1 machineservice.go:602] Creating a bootable volume from image dev-rtp-101-5zn4j-rhcos. I0324 16:09:54.369526 1 machineservice.go:636] Bootable volume ef999343-08ea-4b1f-8e3a-fede1daff4af was created successfully. E0324 16:09:56.899941 1 actuator.go:574] Machine error dev-rtp-101-5zn4j-ext-kube-rp-1c-6792t: error creating Openstack instance: Create new server err: Request forbidden: [POST https://cloud-rtp-1.custy.com:8774/v2.1/servers], error message: {"forbidden": {"message": "Quota exceeded for ram: Requested 32768, but already used 501760 of 532480 ram", "code": 403}} W0324 16:09:56.899970 1 controller.go:317] dev-rtp-101-5zn4j-ext-kube-rp-1c-6792t: failed to create machine: error creating Openstack instance: Create new server err: Request forbidden: [POST https://cloud-rtp-1.custy.com:8774/v2.1/servers], error message: {"forbidden": {"message": "Quota exceeded for ram: Requested 32768, but already used 501760 of 532480 ram", "code": 403}} E0324 16:09:56.900033 1 controller.go:237] controller "msg"="Reconciler error" "error"="error creating Openstack instance: Create new server err: Request forbidden: [POST https://cloud-rtp-1.custy.com:8774/v2.1/servers], error message: {\"forbidden\": {\"message\": \"Quota exceeded for ram: Requested 32768, but already used 501760 of 532480 ram\", \"code\": 403}}" "controller"="machine_controller" "name"="dev-rtp-101-5zn4j-ext-kube-rp-1c-6792t" "namespace"="openshift-machine-api" ```
I see the symptom of many 'available' volumes, possible the same condition in machine-controller in https://bugzilla.redhat.com/show_bug.cgi?id=1916596#c3
Machine-controller is machine-api, not machine-config. Reassigning
This bug is valid. There are 2 issues here: 1. InstanceCreate does not check for an existing available volume with the same name before creating a new one. 2. InstanceDelete does not check that the volume has been deleted before returning success. In the latter case, the volume will normally be deleted with the instance, but if the instance was never created then this will not happen. This bug also exists upstream.
*** Bug 1916596 has been marked as a duplicate of this bug. ***
*** Bug 1952364 has been marked as a duplicate of this bug. ***
The bug is present in 4.6 as well.
Here I answer a private comment asking whether 4.8.0 will contain the fix. The answer is: 4.8 Code freeze was 2021-06-11. There is no chance that the fix will be in 4.8.0 (GA 2021-07-27). This bug will be fixed in the 4.9 development branch and backported to earlier versions' z-streams as far as we realistically can.
Unfortunately this isn't quite fixed yet: https://github.com/openshift/cluster-api-provider-openstack/pull/188#discussion_r673919014
> this isn't quite fixed yet It should be now.
I know this BZ was opened specifically for the volumes that are leaking, but we have also seen ports leak as well. Instance quota leads to volume leaking. Repeat volume leaking eventually leads to port leaking. I think that hints to an overall atomic commit/rollback that must happen when openstack actuator creates the instance, port, volume, (anything else etc.). I understand if the port needs a separate BZ, but want to make sure it is tracked nonetheless.
Verified on 4.9.0-0.nightly-2021-07-27-181211 on OSP16.1 (RHOS-16.1-RHEL-8-20210604.n.0). Creating a new worker which is reaching the RAM qouta for the project and includes a rootVolume, provokes that the volume is created on openstack. However, once the Machine creation failsthe volume is removed on OSP. $ oc apply -f new_worker.yaml # content of the file below. $ oc logs -n openshift-machine-api machine-api-controllers-f85979566-wkjlf machine-controller -f [...] I0728 16:18:45.389205 1 controller.go:174] ostest-bhssb-worker-new: reconciling Machine I0728 16:18:46.641397 1 controller.go:364] ostest-bhssb-worker-new: reconciling machine triggers idempotent create I0728 16:18:51.959617 1 machineservice.go:726] Creating bootable volume with name "ostest-bhssb-worker-new" from image "rhcos". I0728 16:19:51.554436 1 machineservice.go:782] Bootable volume 25b8ae77-16d6-400e-b71a-7af1ccbd6f94 was created successfully. I0728 16:19:53.980340 1 machineservice.go:773] Deleted stale volume "25b8ae77-16d6-400e-b71a-7af1ccbd6f94" E0728 16:19:53.997504 1 actuator.go:574] Machine error ostest-bhssb-worker-new: error creating Openstack instance: Request forbidden: [POST https://overcloud.redhat.local:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 114688 of 114688 ram"}} W0728 16:19:53.997534 1 controller.go:366] ostest-bhssb-worker-new: failed to create machine: error creating Openstack instance: Request forbidden: [POST https://overcloud.redhat.local:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 114688 of 114688 ram"}} E0728 16:19:53.997648 1 controller.go:302] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Request forbidden: [POST https://overcloud.redhat.local:13774/v2.1/servers], error message: {\"forbidden\": {\"code\": 403, \"message\": \"Quota exceeded for ram: Requested 16384, but already used 114688 of 114688 ram\"}}" "name"="ostest-bhssb-worker-new" "namespace"="openshift-machine-api" The above process repeats indefinitely and all resources are removed and the cluster remains stable: $ openstack volume list -f value | grep available | wc -l 1 $ openstack volume list -f value | grep available | wc -l 0 It is observed in 4.6 (4.6.0-0.nightly-2021-07-09-014429) that the same procedure causes excess volumes in each iteration of reconciliation: I0728 16:14:59.914885 1 machineservice.go:624] Creating a bootable volume from image cirros. I0728 16:15:06.495430 1 machineservice.go:658] Bootable volume c873c047-82f4-4fb9-8e0c-56da98cbff52 was created successfully. E0728 16:15:08.691085 1 actuator.go:550] Machine error ostest-jg7gx-worker-0-new: error creating Openstack instance: Create new server err: Request forbidden: [POST https://10.0.0.101:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 98304 of 98304 ram"}} W0728 16:15:08.692644 1 controller.go:316] ostest-jg7gx-worker-0-new: failed to create machine: error creating Openstack instance: Create new server err: Request forbidden: [POST https://10.0.0.101:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 98304 of 98304 ram"}} I0728 16:15:09.693229 1 controller.go:170] ostest-jg7gx-worker-0-new: reconciling Machine I0728 16:15:12.034713 1 controller.go:314] ostest-jg7gx-worker-0-new: reconciling machine triggers idempotent create $ openstack volume list -f value | grep available | wc -l 37 #### $ cat new_worker.yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: labels: machine.openshift.io/cluster-api-cluster: ostest-bhssb machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/instance-type: m4.xlarge machine.openshift.io/region: regionOne machine.openshift.io/zone: AZhci-0 name: ostest-bhssb-worker-new namespace: openshift-machine-api spec: metadata: {} providerSpec: value: apiVersion: openstackproviderconfig.openshift.io/v1alpha1 availabilityZone: AZhci-0 cloudName: openstack cloudsSecret: name: openstack-cloud-credentials namespace: openshift-machine-api flavor: m4.xlarge image: "" kind: OpenstackProviderSpec metadata: creationTimestamp: null networks: - filter: {} subnets: - filter: name: ostest-bhssb-nodes tags: openshiftClusterID=ostest-bhssb - filter: {} noAllowedAddressPairs: true uuid: 8be7adde-dbf0-49d7-b0f1-11e583278172 rootVolume: availabilityZone: cinderAZ1 deviceType: "" diskSize: 25 sourceType: image sourceUUID: rhcos volumeType: tripleo securityGroups: - filter: {} name: ostest-bhssb-worker serverMetadata: Name: ostest-bhssb-worker openshiftClusterID: ostest-bhssb tags: - openshiftClusterID=ostest-bhssb trunk: true userDataSecret: name: worker-user-data
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759