Bug 1943378
Summary: | OpenStack machine_controller does not remove boot volumes when reconciler errors | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Andrew Collins <ancollin> | |
Component: | Cloud Compute | Assignee: | Pierre Prinetti <pprinett> | |
Cloud Compute sub component: | OpenStack Provider | QA Contact: | rlobillo | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | adduarte, egarcia, emacchi, hfukumot, itbrown, m.andre, mbooth, mfedosin, mimccune, pprinett | |
Version: | 4.6.z | Keywords: | Triaged | |
Target Milestone: | --- | |||
Target Release: | 4.9.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: When a server is set to boot from volume, and an image ID is passed, CAPO creates the root volume prior to creating the server. The root volume is set to be destroyed when the associated server is destroyed. However, if the server fails to create (notably because of quota limits), the volume is never associated to a server and the automatic deletion is never triggered.
Consequence: When Server quota limit is hit, at every round of server creation retry, a new volume is created, possibly until volume quota is reached (or server creation is successful). This results in a leakage of unused volumes.
Fix: With this fix, a newly created root volume is explicitly deleted as soon as the server creation call fails.
Result: When server creation fails (notably because of quota limits), the pre-created root volume is deleted before the next retry.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1985015 (view as bug list) | Environment: | ||
Last Closed: | 2021-10-18 17:29:50 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1985015 |
Description
Andrew Collins
2021-03-25 23:03:14 UTC
Created attachment 1766445 [details] openstack volume list of dangling volumes. Added output from openstack to show the volumes for nodes that failed to provision successfully. ``` # openstack volume list --long -f value -c Name -c Status |awk '{print $2}' | sort | uniq -c 88 available 12 in-use ``` ``` I0324 16:08:48.442116 1 machineservice.go:602] Creating a bootable volume from image dev-rtp-101-5zn4j-rhcos. I0324 16:09:54.369526 1 machineservice.go:636] Bootable volume ef999343-08ea-4b1f-8e3a-fede1daff4af was created successfully. E0324 16:09:56.899941 1 actuator.go:574] Machine error dev-rtp-101-5zn4j-ext-kube-rp-1c-6792t: error creating Openstack instance: Create new server err: Request forbidden: [POST https://cloud-rtp-1.custy.com:8774/v2.1/servers], error message: {"forbidden": {"message": "Quota exceeded for ram: Requested 32768, but already used 501760 of 532480 ram", "code": 403}} W0324 16:09:56.899970 1 controller.go:317] dev-rtp-101-5zn4j-ext-kube-rp-1c-6792t: failed to create machine: error creating Openstack instance: Create new server err: Request forbidden: [POST https://cloud-rtp-1.custy.com:8774/v2.1/servers], error message: {"forbidden": {"message": "Quota exceeded for ram: Requested 32768, but already used 501760 of 532480 ram", "code": 403}} E0324 16:09:56.900033 1 controller.go:237] controller "msg"="Reconciler error" "error"="error creating Openstack instance: Create new server err: Request forbidden: [POST https://cloud-rtp-1.custy.com:8774/v2.1/servers], error message: {\"forbidden\": {\"message\": \"Quota exceeded for ram: Requested 32768, but already used 501760 of 532480 ram\", \"code\": 403}}" "controller"="machine_controller" "name"="dev-rtp-101-5zn4j-ext-kube-rp-1c-6792t" "namespace"="openshift-machine-api" ``` I see the symptom of many 'available' volumes, possible the same condition in machine-controller in https://bugzilla.redhat.com/show_bug.cgi?id=1916596#c3 I see the symptom of many 'available' volumes, possible the same condition in machine-controller in https://bugzilla.redhat.com/show_bug.cgi?id=1916596#c3 Machine-controller is machine-api, not machine-config. Reassigning This bug is valid. There are 2 issues here: 1. InstanceCreate does not check for an existing available volume with the same name before creating a new one. 2. InstanceDelete does not check that the volume has been deleted before returning success. In the latter case, the volume will normally be deleted with the instance, but if the instance was never created then this will not happen. This bug also exists upstream. *** Bug 1916596 has been marked as a duplicate of this bug. *** *** Bug 1952364 has been marked as a duplicate of this bug. *** The bug is present in 4.6 as well. Here I answer a private comment asking whether 4.8.0 will contain the fix. The answer is: 4.8 Code freeze was 2021-06-11. There is no chance that the fix will be in 4.8.0 (GA 2021-07-27). This bug will be fixed in the 4.9 development branch and backported to earlier versions' z-streams as far as we realistically can. Unfortunately this isn't quite fixed yet: https://github.com/openshift/cluster-api-provider-openstack/pull/188#discussion_r673919014 > this isn't quite fixed yet
It should be now.
I know this BZ was opened specifically for the volumes that are leaking, but we have also seen ports leak as well. Instance quota leads to volume leaking. Repeat volume leaking eventually leads to port leaking. I think that hints to an overall atomic commit/rollback that must happen when openstack actuator creates the instance, port, volume, (anything else etc.). I understand if the port needs a separate BZ, but want to make sure it is tracked nonetheless. Verified on 4.9.0-0.nightly-2021-07-27-181211 on OSP16.1 (RHOS-16.1-RHEL-8-20210604.n.0). Creating a new worker which is reaching the RAM qouta for the project and includes a rootVolume, provokes that the volume is created on openstack. However, once the Machine creation failsthe volume is removed on OSP. $ oc apply -f new_worker.yaml # content of the file below. $ oc logs -n openshift-machine-api machine-api-controllers-f85979566-wkjlf machine-controller -f [...] I0728 16:18:45.389205 1 controller.go:174] ostest-bhssb-worker-new: reconciling Machine I0728 16:18:46.641397 1 controller.go:364] ostest-bhssb-worker-new: reconciling machine triggers idempotent create I0728 16:18:51.959617 1 machineservice.go:726] Creating bootable volume with name "ostest-bhssb-worker-new" from image "rhcos". I0728 16:19:51.554436 1 machineservice.go:782] Bootable volume 25b8ae77-16d6-400e-b71a-7af1ccbd6f94 was created successfully. I0728 16:19:53.980340 1 machineservice.go:773] Deleted stale volume "25b8ae77-16d6-400e-b71a-7af1ccbd6f94" E0728 16:19:53.997504 1 actuator.go:574] Machine error ostest-bhssb-worker-new: error creating Openstack instance: Request forbidden: [POST https://overcloud.redhat.local:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 114688 of 114688 ram"}} W0728 16:19:53.997534 1 controller.go:366] ostest-bhssb-worker-new: failed to create machine: error creating Openstack instance: Request forbidden: [POST https://overcloud.redhat.local:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 114688 of 114688 ram"}} E0728 16:19:53.997648 1 controller.go:302] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Request forbidden: [POST https://overcloud.redhat.local:13774/v2.1/servers], error message: {\"forbidden\": {\"code\": 403, \"message\": \"Quota exceeded for ram: Requested 16384, but already used 114688 of 114688 ram\"}}" "name"="ostest-bhssb-worker-new" "namespace"="openshift-machine-api" The above process repeats indefinitely and all resources are removed and the cluster remains stable: $ openstack volume list -f value | grep available | wc -l 1 $ openstack volume list -f value | grep available | wc -l 0 It is observed in 4.6 (4.6.0-0.nightly-2021-07-09-014429) that the same procedure causes excess volumes in each iteration of reconciliation: I0728 16:14:59.914885 1 machineservice.go:624] Creating a bootable volume from image cirros. I0728 16:15:06.495430 1 machineservice.go:658] Bootable volume c873c047-82f4-4fb9-8e0c-56da98cbff52 was created successfully. E0728 16:15:08.691085 1 actuator.go:550] Machine error ostest-jg7gx-worker-0-new: error creating Openstack instance: Create new server err: Request forbidden: [POST https://10.0.0.101:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 98304 of 98304 ram"}} W0728 16:15:08.692644 1 controller.go:316] ostest-jg7gx-worker-0-new: failed to create machine: error creating Openstack instance: Create new server err: Request forbidden: [POST https://10.0.0.101:13774/v2.1/servers], error message: {"forbidden": {"code": 403, "message": "Quota exceeded for ram: Requested 16384, but already used 98304 of 98304 ram"}} I0728 16:15:09.693229 1 controller.go:170] ostest-jg7gx-worker-0-new: reconciling Machine I0728 16:15:12.034713 1 controller.go:314] ostest-jg7gx-worker-0-new: reconciling machine triggers idempotent create $ openstack volume list -f value | grep available | wc -l 37 #### $ cat new_worker.yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: labels: machine.openshift.io/cluster-api-cluster: ostest-bhssb machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/instance-type: m4.xlarge machine.openshift.io/region: regionOne machine.openshift.io/zone: AZhci-0 name: ostest-bhssb-worker-new namespace: openshift-machine-api spec: metadata: {} providerSpec: value: apiVersion: openstackproviderconfig.openshift.io/v1alpha1 availabilityZone: AZhci-0 cloudName: openstack cloudsSecret: name: openstack-cloud-credentials namespace: openshift-machine-api flavor: m4.xlarge image: "" kind: OpenstackProviderSpec metadata: creationTimestamp: null networks: - filter: {} subnets: - filter: name: ostest-bhssb-nodes tags: openshiftClusterID=ostest-bhssb - filter: {} noAllowedAddressPairs: true uuid: 8be7adde-dbf0-49d7-b0f1-11e583278172 rootVolume: availabilityZone: cinderAZ1 deviceType: "" diskSize: 25 sourceType: image sourceUUID: rhcos volumeType: tripleo securityGroups: - filter: {} name: ostest-bhssb-worker serverMetadata: Name: ostest-bhssb-worker openshiftClusterID: ostest-bhssb tags: - openshiftClusterID=ostest-bhssb trunk: true userDataSecret: name: worker-user-data Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |