Bug 1870282 - RBD - postpone snapshot removal until final volume deleted
Summary: RBD - postpone snapshot removal until final volume deleted
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: z16
: ---
Assignee: Rajat Dhasmana
QA Contact: Tzach Shefi
RHOS Documentation Team
URL:
Whiteboard:
: 1575652 1791829 (view as bug list)
Depends On: 1575652
Blocks: 1437392 1795959
TreeView+ depends on / blocked
 
Reported: 2020-08-19 16:17 UTC by Gregory Charot
Modified: 2023-12-15 18:56 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of: 1575652
Environment:
Last Closed: 2021-04-01 08:46:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 754397 0 None NEW RBD: Delete snapshots with volume dependency 2021-02-16 18:57:38 UTC
Red Hat Issue Tracker OSP-1648 0 None None None 2022-08-30 12:13:29 UTC

Description Gregory Charot 2020-08-19 16:17:24 UTC
Cloning this bug to assess workaround for OSP13.

OSP 16.x we are looking at backporting the optimal solution (clone v2) tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1764324 

OSP13 we want to find a compromise to avoid the ImageBusy behavior.

+++ This bug was initially created as a clone of Bug #1575652 +++

Problem statement: 

You can create a snapshot and then create a volume from that snapshot.
Until this patch, if you then tried to delete your snapshot, the RBD
driver would raise ImageBusy as the snapshot still has dependent
children.  Since delete_snapshot() is asynchronous, the client is not
notified of the situation, and confusion ensues.

This feature changes the delete_snapshot behaviour by hiding snapshots
that have dependent children until the final child volume is deleted.
Upon final volume deletion, the 'pending' snapshot is then removed.



+++ This bug was initially created as a clone of Bug #1254470 +++

Description of problem:
The deletion of a snapshot fails with the error 

2015-08-18 11:11:39.198 4418 ERROR cinder.volume.manager [req-e721e329-ba76-4088-bce6-dbfb0757fe05 b7bc6cb74f5d4935b8a994d4c4582de2 e577bce6013b4fde98e5e44bc82e7ca1 - - -] Cannot delete snapshot 0b33c534-7f52-4aa5-9fbb-725eb929b771: snapshot is busy

while exists a volume that was created from it. 

Version-Release number of selected component (if applicable):
python-cinder-2015.1.0-3.el7ost.noarch
ceph-common-0.80.8-7.el7cp.x86_64
openstack-cinder-2015.1.0-3.el7ost.noarch
python-cinderclient-1.2.1-1.el7ost.noarch


How reproducible:
100% 

Steps to Reproduce:
1. Create a volume 
2. Take a snapshot of the volume
3. Create a new volume from the snapshot
4. Delete the snapshot 
5. Check the snapshot list

Actual results:
The snapshot stays in a available state

Expected results:
The Cinder client will provide an output that explains why the action failed or set the snapshot with a deleted tag in the Ceph pool and delete the snapshot record in Cinder

--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-08-18 04:22:16 EDT ---

Since this issue was entered in bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-08-18 04:22:16 EDT ---

The Target Release has been set to match the release flag.

--- Additional comment from Sergey Gotliv on 2015-09-21 08:58:00 EDT ---



--- Additional comment from Jon Bernard on 2016-02-17 17:23:06 EST ---

I believe the current patch in review most correctly addresses the problem.

--- Additional comment from Jon Bernard on 2016-08-18 10:31:17 EDT ---

Just to clarify, patch is complete and awaiting reviews upstream.

--- Additional comment from Elise Gafford on 2016-11-01 12:53:09 EDT ---

No recent progress on this issue (no review attention on upstream patch). Low priority. Moving to RHOS 11 for triage.

--- Additional comment from Scott Lewis on 2017-01-13 11:49:38 EST ---

Updating Target Milestone, since it's already ON-DEV, putting into M3.

--- Additional comment from Paul Grist on 2017-03-27 19:36:42 EDT ---

Will move this to OSP12 soon, doesn't look like patch is moving and need to start reducing OSP11 to must-fix only.

--- Additional comment from Paul Grist on 2017-07-13 19:30:52 EDT ---

Triage to OSP12 and see if we can get this patch merged again

--- Additional comment from Scott Lewis on 2017-11-22 14:05:11 EST ---

Removing TM for non-blockers in On-DEV

--- Additional comment from Scott Lewis on 2017-12-14 09:11:58 EST ---

Bulk post GA move to zstream

--- Additional comment from MD Sufiyan on 2018-02-11 20:22:13 EST ---

Hi Team,

Please update on this, we have another Cu hitting this issue.

Thanks in advance..
Rgds,
Sufiyan

--- Additional comment from Tzach Shefi on 2018-04-22 13:38:02 EDT ---

Hi Jon, 

Any news on this fix? 
I'd just checked this on osp13 still a bug.

--- Additional comment from Tzach Shefi on 2018-04-23 03:57:52 EDT ---

Added downstream automation coverage:

https://polarion.engineering.redhat.com/polarion/redirect/project/RHELOpenStackPlatform/workitem?id=RHELOSP-35341

--- Additional comment from Jon Bernard on 2018-05-02 12:20:46 EDT ---

I'll have to revisit this, the patch I proposed back then didn't merge, another version was considered from a different author - but I think that one has been abandoned.

--- Additional comment from PnT Account Manager on 2019-01-02 23:57:32 CET ---

Employee 'knylande' has left the company.

--- Additional comment from Jon Bernard on 2019-01-09 03:16:22 CET ---

Looks to be landing soon upstream

--- Additional comment from Eric Harney on 2019-04-18 15:31:42 CEST ---

Feature: deferred deletion for the RBD driver

https://review.openstack.org/#/c/608984/

--- Additional comment from Gorka Eguileor on 2019-05-16 16:35:12 CEST ---

@Eric, this is not about the deferred deletion of volumes, I believe this is mostly a bug fix not an RFE.

According to this we cannot create a volume from a snapshot and then delete the snapshot, so we need to do a similar renaming to the snapshots as we do to the volumes when we delete them (rename them to end with .deleted).

--- Additional comment from Paul Grist on 2019-06-04 17:42:50 CEST ---

If it is indeed the latter in comment 4, can we update this one accordingly and remove FutureFeature?

--- Additional comment from Jon Bernard on 2019-06-04 21:26:37 CEST ---

After discussion with Gorka, we both agree that Arne's trash feature is the best way to handle this.  I'm working to verify those patches and make sure everything is working.  I think we can still remove the FutureFeature flag though.

--- Additional comment from Jon Bernard on 2019-06-04 21:26:53 CEST ---

After discussion with Gorka, we both agree that Arne's trash feature is the best way to handle this.  I'm working to verify those patches and make sure everything is working.  I think we can still remove the FutureFeature flag though.

--- Additional comment from Gregory Charot on 2019-06-05 12:30:36 CEST ---

(In reply to Jon Bernard from comment #7)
> After discussion with Gorka, we both agree that Arne's trash feature is the
> best way to handle this.  I'm working to verify those patches and make sure
> everything is working.  I think we can still remove the FutureFeature flag
> though.

Removing FutureFeature keyword as per Jon comment.

Tzach can you please ack this bug, it has been declared as bug rather than a RFE (no automatic test, CI, etc). Thanks

--- Additional comment from RHEL Program Management on 2019-07-17 16:01:33 CEST ---

This item has been properly Triaged and planned for the release, and Target Release is now set to match the release flag. For details, see https://mojo.redhat.com/docs/DOC-1144661#jive_content_id_OSP_Release_Planning

--- Additional comment from Scott Lewis on 2019-07-17 16:03:08 CEST ---

This item has been properly Triaged and planned for the appropriate release, and is being tagged for tracking.

--- Additional comment from Jon Bernard on 2019-07-17 20:29:45 CEST ---

Arne's patch does not work the way I thought it did - it uses the trash feature to speed up independent removals leveraging asynchronicity, but it doesn't make use of deferring dependent snapshots and volumes.  It can without too much effort, and the good news is that RBD images in the trash namespace are not reported on pool usage (this is important), so some work remains to bridge this gap.

--- Additional comment from Giulio Fidente on 2019-10-10 16:19:12 CEST ---

(In reply to Jon Bernard from comment #11)
> Arne's patch does not work the way I thought it did - it uses the trash
> feature to speed up independent removals leveraging asynchronicity, but it
> doesn't make use of deferring dependent snapshots and volumes.  It can
> without too much effort, and the good news is that RBD images in the trash
> namespace are not reported on pool usage (this is important), so some work
> remains to bridge this gap.

will there be something periodically trying to delete the images initially moved in #trash too?

--- Additional comment from Tzach Shefi on 2019-11-05 08:09:36 CET ---

Greg,
As we are already QA-acked on this, removing needinfo.

--- Additional comment from RHEL Program Management on 2019-11-13 20:35:52 CET ---

This bugzilla has been removed from the release since it  does not have an acked release flag. For details, see https://mojo.redhat.com/docs/DOC-1144661#jive_content_id_OSP_Release_Planning.'

--- Additional comment from Scott Lewis on 2019-11-18 17:27:44 CET ---

This item has had a change in release flag, and has been removed from tracking for the GA.

--- Additional comment from RHEL Program Management on 2020-02-27 06:29:38 CET ---

This item has been properly Triaged and planned for the release, and Target Release is now set to match the release flag. For details, see https://mojo.redhat.com/docs/DOC-1195410

--- Additional comment from Giulio Fidente on 2020-03-04 19:15:30 CET ---

Jon should this be in POST state?

--- Additional comment from Jon Bernard on 2020-03-04 21:10:41 CET ---

I don’t think so, this feature should be included in the upcoming v2 clone patch - but that’s not yet posted upstream.

--- Additional comment from Gregory Charot on 2020-03-05 18:43:28 CET ---

I'm hesitating to push this one to 17, will that be merged in Ussuri and how likely it is possible to backport to train ? thx

--- Additional comment from Jon Bernard on 2020-03-05 19:32:01 CET ---

This should be resolved by the clone API driver update to support both volume and snapshot deferred deletion, I’m working on a multiattach testing tool right now, rever-to-snapshot is waiting on reviews, and this is next on my list after multiattach.

--- Additional comment from Gregory Charot on 2020-03-06 12:18:35 CET ---

Thanks Jon, I wanted to align on downstream integration and assess backport feasibility to OSP16 (train) or push this to OSP17 (Victoria).

Question: Will the clone API driver update land in Ussuri or Victoria and how likely is it backportable ? Does that modify the APIs and/or DB schema ?

--- Additional comment from Jon Bernard on 2020-03-20 21:15:03 CET ---

(In reply to Gregory Charot from comment #21)
> Thanks Jon, I wanted to align on downstream integration and assess backport
> feasibility to OSP16 (train) or push this to OSP17 (Victoria).
> 
> Question: Will the clone API driver update land in Ussuri or Victoria and
> how likely is it backportable ? Does that modify the APIs and/or DB schema ?

Given my current situation, I would say OSP17/Victoria.  The change itself will require at least Ceph Mimic (13.x) which shouldn't be a problem.  While it's technically possible to backport, I do expect it to be a somewhat large change.  If we can avoid backporting it, that would be my recommendation - I doubt upstream would be willing to accept it, so it would be something we'd have to maintain on our own.  That said, there are no API or DB schema changes, so maybe it won't be that bad, it really depends on the final size of the patch and the delta between osp16 and the release it lands in.

--- Additional comment from Gregory Charot on 2020-03-23 11:24:17 CET ---

Thanks Jon, I moved the clone v2 to 17 (Victoria), when the patch is merged upstream we can evaluate backport feasibility based on customer (high) demands.

Now my second question, should we mark this BZ duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1764324 ?

Or should we explicitly verify this RFE? From what you told me this solves a very limited number of cases. Please work with Eric and Luigi in order to have a consensus. Thanks!

--- Additional comment from Gregory Charot on 2020-06-25 14:29:12 CEST ---

Moving to z2 as z1 will be blockers only

Comment 2 Gregory Charot 2020-08-19 16:24:54 UTC
*** Bug 1575652 has been marked as a duplicate of this bug. ***

Comment 6 Gregory Charot 2021-03-31 09:18:47 UTC
We are too late in the OSP13 cycle (last zstream before ELS) to add new feature specially ones that deal with how we managing something as critical as snapshots.

Comment 7 Gregory Charot 2021-03-31 09:22:01 UTC
As a FYI for any customer interested in this feature, it is planned to be supported in 16.2 with native Ceph support. We are open to have the support backported in 16.1 if customers really want it.

16.2 RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1764324

Comment 9 Gregory Charot 2021-03-31 09:24:47 UTC
As a FYI for any customer interested in this feature, it is planned to be supported in 16.2 with native Ceph support. We are open to have the support backported in 16.1 if customers really want it.

16.2 RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1764324

Comment 10 Luigi Toscano 2021-04-01 09:51:47 UTC
*** Bug 1791829 has been marked as a duplicate of this bug. ***

Comment 11 Alan Bishop 2021-06-08 13:37:08 UTC
*** Bug 1969440 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.