2256751 – Automatic flattening of snapshots are not working

Bug 2256751 - Automatic flattening of snapshots are not working

Summary: Automatic flattening of snapshots are not working

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	documentation
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Kusuma
QA Contact:	Olive Lakra
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-01-04 11:05 UTC by Rakshith
Modified:	2024-06-12 18:57 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Release Note
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-06-12 18:57:37 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2232163	0	unspecified	CLOSED	Automatic flattening of snapshots are not working	2024-04-24 11:45:16 UTC

Description Rakshith 2024-01-04 11:05:43 UTC

This bug was initially created as a copy of Bug #2232163

I am copying this bug because: 

We need to document it as known issue 
```
Cause: Flattening of rbd images in trash cannot be performed.

Consequence: With a single common parent rbd PVC, if the sequence "Snapshot-Restore-DeleteSnapshot" is performed more than 450 times, further snapshot or clone of the common parent rbd PVC can be taken.

Workaround (if any): Manual flattening of the final restore PVCs should be performed to ensure the parent PVC can undergo snapshot or clone again. Instead of using Snapshot-Restore-DeleteSnapshot sequeunce, users should make use of PVC to PVC clone to completely avoid this issue.

Result: Use of PVC to PVC clone method will not hit this issue.
```

Description of problem :

The issue is observed in OpenShift Virtualization environment where the snapshot hits the hard limit of 450 when the user creates 450+ VMs from the golden image/template image. When a VM is created from the golden image/template image, the containerized-data-importer(CDI) does smart cloning[1]:

=> Create a snapshot of the source PVC
=> Create a PVC from the created snapshot
=> Delete the snapshot

So even if we delete the snapshot it will be in the trash as it's still linked to the cloned images.

~~~
A cloned image:

# rbd info orion/csi-vol-fdc42829-e939-4cad-be29-d5f6046eefc2 |grep parent
	parent: orion/csi-snap-5a84979d-7ab0-4be2-afd4-c10fdab3dde2@2762fc24-7919-4ea2-b096-d92636352751 (trash 30fdb8fd11e6fc)

# rbd trash ls --pool orion |grep 30fdb8fd11e6fc
30fdb8fd11e6fc csi-snap-5a84979d-7ab0-4be2-afd4-c10fdab3dde2

Source image:

rbd snap ls  orion/csi-vol-577fa68a-4b37-479a-bf58-023a320af82c --all |grep csi-snap-5a84979d-7ab0-4be2-afd4-c10fdab3dde2
  2153  57a5cce4-8561-4093-812a-bf881cae177d  30 GiB             Tue Aug 15 10:49:42 2023  trash (csi-snap-5a84979d-7ab0-4be2-afd4-c10fdab3dde2)
~~~

The csi driver starts flattening the older snapshots when it reaches the soft limit of 250. However, it's failing with the error below:

~~~
I0815 13:49:48.576387       1 controllerserver.go:557] ID: 2371 Req-ID: snapshot-e57264db-b2c8-4bc7-8126-13fd9df21038 snapshots count 254 on image: orion/csi-vol-577fa68a-4b37-479a-bf58-023a320af82c reached configured soft limit 250

E0815 13:49:48.786336       1 rbd_util.go:823] ID: 2369 Req-ID: snapshot-e4e5635f-7f7c-499e-a287-c58d15c115f4 failed to add task flatten for orion/csi-snap-54fa655a-a088-463f-a2fc-7e747fe5d2b7 : rados: ret=-2, No such file or directory: "[errno 2] RBD image not found (Image orion/csi-snap-54fa655a-a088-463f-a2fc-7e747fe5d2b7 does not exist)"

E0815 13:49:48.797025       1 rbd_util.go:823] ID: 2369 Req-ID: snapshot-e4e5635f-7f7c-499e-a287-c58d15c115f4 failed to add task flatten for orion/csi-snap-d4613468-6990-4ffd-8494-a87d1fd6fa08 : rados: ret=-2, No such file or directory: "[errno 2] RBD image not found (Image orion/csi-snap-d4613468-6990-4ffd-8494-a87d1fd6fa08 does not exist)"

E0815 13:49:48.807744       1 rbd_util.go:771] ID: 2369 Req-ID: snapshot-e4e5635f-7f7c-499e-a287-c58d15c115f4 failed to flatten orion/csi-snap-5b39031a-644c-4579-b58b-5c3357efd15b; err rados: ret=-2, No such file or directory: "[errno 2] RBD image not found (Image orion/csi-snap-5b39031a-644c-4579-b58b-5c3357efd15b does not exist)"

E0815 13:49:48.819930       1 rbd_util.go:823] ID: 2369 Req-ID: snapshot-e4e5635f-7f7c-499e-a287-c58d15c115f4 failed to add task flatten for orion/csi-snap-b16fe0cc-07fd-4b75-a59d-541b96151f40 : rados: ret=-2, No such file or directory: "[errno 2] RBD image not found (Image orion/csi-snap-b16fe0cc-07fd-4b75-a59d-541b96151f40 does not exist)"
~~~

These images which csi is trying to flatten are in the trash:

~~~
# rbd trash ls --pool orion |egrep "54fa655a|d4613468|5b39031a|b16fe0cc"
30fdb892b3c763 csi-snap-54fa655a-a088-463f-a2fc-7e747fe5d2b7
30fdb8adb44384 csi-snap-5b39031a-644c-4579-b58b-5c3357efd15b
30fdb8b8b7c76f csi-snap-d4613468-6990-4ffd-8494-a87d1fd6fa08
30fdb8e671371f csi-snap-b16fe0cc-07fd-4b75-a59d-541b96151f40
~~~

Subsequently, after creating more VMs from the golden image, it hit the hard limit:


~~~
I0815 15:11:45.211253       1 controllerserver.go:536] ID: 3701 Req-ID: snapshot-ecd44693-7d3c-4021-aa39-502857abce2a snapshots count 454 on image: orion/csi-vol-577fa68a-4b37-479a-bf58-023a320af82c reached configured hard limit 450



# rbd info  csi-vol-577fa68a-4b37-479a-bf58-023a320af82c  --pool orion
rbd image 'csi-vol-577fa68a-4b37-479a-bf58-023a320af82c':
	size 30 GiB in 7680 objects
	order 22 (4 MiB objects)
	snapshot_count: 454                 <===
	id: 27cab14fb82d74
	block_name_prefix: rbd_data.27cab14fb82d74
	format: 2
	features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, operations
	op_features: clone-parent, snap-trash
	flags:
	create_timestamp: Tue Aug  1 09:52:03 2023
	access_timestamp: Tue Aug 15 11:30:17 2023
	modify_timestamp: Tue Aug  1 09:52:03 2023

~~~

So the newer volumesnapshots have to indefinitely wait.

Shouldn't it run flatten on the cloned image rather than the temporary snapshot in the trash?

~~~
# rbd info orion/csi-vol-f6c4ae6d-de88-4840-b278-a226c8fc6942 |grep parent
	parent: orion/csi-snap-37801a60-460d-4ff1-a73d-0fb5f7d7a349@2fdad284-f361-44eb-bcc0-b5f7e2bf11f4 (trash 30fdb86b06e5f9)

# rbd flatten orion/csi-snap-37801a60-460d-4ff1-a73d-0fb5f7d7a349
rbd: error opening image csi-snap-37801a60-460d-4ff1-a73d-0fb5f7d7a349: (2) No such file or directory

# rbd flatten orion/csi-vol-f6c4ae6d-de88-4840-b278-a226c8fc6942
Image flatten: 100% complete...done.
~~~



Version of all relevant components (if applicable):

OpenShift Data Foundation     4.12.5-rhodf


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

The user is not able to create more than 450+ VMs from a golden image.


Is there any workaround available to the best of your knowledge?

Users have to manually flatten the cloned image.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

2.

Can this issue reproducible?

Yes, 100%.


Can this issue reproduce from the UI?

Yes.


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

I reproduced it in OpenShifift Virtualization, but I think it should be also easily reproducible by creating snapshots and then cloning images from the snapshot, 450+ times.

I created 450+ VMs using the below for loop:

# for i in {1..460};do virtctl create vm --volume-datasource=src:openshift-virtualization-os-images/rhel9 |oc create -f -;sleep 10;done 


Actual results:

The automatic flattening of snapshots is not working.

Expected results:

It should start flattening the snapshots automatically when the number of snapshots of the image reaches 250.

Additional info:

[1] https://github.com/kubevirt/containerized-data-importer/blob/main/doc/smart-clone.md#smart-cloning

Note You need to log in before you can comment on or make changes to this bug.