2232163 – Automatic flattening of snapshots are not working

Bug 2232163 - Automatic flattening of snapshots are not working

Summary: Automatic flattening of snapshots are not working

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.12
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Rakshith
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-15 16:26 UTC by nijin ashok
Modified:	2024-04-24 11:45 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	.Automatic flattening of snapshots does not work When there is a single common parent RBD PVC, if volume snapshot, restore, and delete snapshot are performed in a sequence more than 450 times, it is further not possible to take volume snapshot or clone of the common parent RBD PVC. To workaround this issue, instead of performing volume snapshot, restore, and delete snapshot in a sequence, you can use PVC to PVC clone to completely avoid this issue. If you hit this issue, contact customer support to perform manual flattening of the final restore PVCs to continue to take volume snapshot or clone of the common parent PVC again.
Clone Of:
Environment:
Last Closed:	2024-01-04 11:03:16 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi issues 4069	0	None	open	RBD: Snapshots >450 cannot be taken on a PVC in certain cases	2023-08-24 10:29:42 UTC
Red Hat Knowledge Base (Solution)	7030571	0	None	None	None	2023-08-25 13:53:19 UTC

Internal Links: 2256751

Description nijin ashok 2023-08-15 16:26:37 UTC

Description of problem :

The issue is observed in OpenShift Virtualization environment where the snapshot hits the hard limit of 450 when the user creates 450+ VMs from the golden image/template image. When a VM is created from the golden image/template image, the containerized-data-importer(CDI) does smart cloning[1]:

=> Create a snapshot of the source PVC
=> Create a PVC from the created snapshot
=> Delete the snapshot

So even if we delete the snapshot it will be in the trash as it's still linked to the cloned images.

~~~
A cloned image:

# rbd info orion/csi-vol-fdc42829-e939-4cad-be29-d5f6046eefc2 |grep parent
	parent: orion/csi-snap-5a84979d-7ab0-4be2-afd4-c10fdab3dde2@2762fc24-7919-4ea2-b096-d92636352751 (trash 30fdb8fd11e6fc)

# rbd trash ls --pool orion |grep 30fdb8fd11e6fc
30fdb8fd11e6fc csi-snap-5a84979d-7ab0-4be2-afd4-c10fdab3dde2

Source image:

rbd snap ls  orion/csi-vol-577fa68a-4b37-479a-bf58-023a320af82c --all |grep csi-snap-5a84979d-7ab0-4be2-afd4-c10fdab3dde2
  2153  57a5cce4-8561-4093-812a-bf881cae177d  30 GiB             Tue Aug 15 10:49:42 2023  trash (csi-snap-5a84979d-7ab0-4be2-afd4-c10fdab3dde2)
~~~

The csi driver starts flattening the older snapshots when it reaches the soft limit of 250. However, it's failing with the error below:

~~~
I0815 13:49:48.576387       1 controllerserver.go:557] ID: 2371 Req-ID: snapshot-e57264db-b2c8-4bc7-8126-13fd9df21038 snapshots count 254 on image: orion/csi-vol-577fa68a-4b37-479a-bf58-023a320af82c reached configured soft limit 250

E0815 13:49:48.786336       1 rbd_util.go:823] ID: 2369 Req-ID: snapshot-e4e5635f-7f7c-499e-a287-c58d15c115f4 failed to add task flatten for orion/csi-snap-54fa655a-a088-463f-a2fc-7e747fe5d2b7 : rados: ret=-2, No such file or directory: "[errno 2] RBD image not found (Image orion/csi-snap-54fa655a-a088-463f-a2fc-7e747fe5d2b7 does not exist)"

E0815 13:49:48.797025       1 rbd_util.go:823] ID: 2369 Req-ID: snapshot-e4e5635f-7f7c-499e-a287-c58d15c115f4 failed to add task flatten for orion/csi-snap-d4613468-6990-4ffd-8494-a87d1fd6fa08 : rados: ret=-2, No such file or directory: "[errno 2] RBD image not found (Image orion/csi-snap-d4613468-6990-4ffd-8494-a87d1fd6fa08 does not exist)"

E0815 13:49:48.807744       1 rbd_util.go:771] ID: 2369 Req-ID: snapshot-e4e5635f-7f7c-499e-a287-c58d15c115f4 failed to flatten orion/csi-snap-5b39031a-644c-4579-b58b-5c3357efd15b; err rados: ret=-2, No such file or directory: "[errno 2] RBD image not found (Image orion/csi-snap-5b39031a-644c-4579-b58b-5c3357efd15b does not exist)"

E0815 13:49:48.819930       1 rbd_util.go:823] ID: 2369 Req-ID: snapshot-e4e5635f-7f7c-499e-a287-c58d15c115f4 failed to add task flatten for orion/csi-snap-b16fe0cc-07fd-4b75-a59d-541b96151f40 : rados: ret=-2, No such file or directory: "[errno 2] RBD image not found (Image orion/csi-snap-b16fe0cc-07fd-4b75-a59d-541b96151f40 does not exist)"
~~~

These images which csi is trying to flatten are in the trash:

~~~
# rbd trash ls --pool orion |egrep "54fa655a|d4613468|5b39031a|b16fe0cc"
30fdb892b3c763 csi-snap-54fa655a-a088-463f-a2fc-7e747fe5d2b7
30fdb8adb44384 csi-snap-5b39031a-644c-4579-b58b-5c3357efd15b
30fdb8b8b7c76f csi-snap-d4613468-6990-4ffd-8494-a87d1fd6fa08
30fdb8e671371f csi-snap-b16fe0cc-07fd-4b75-a59d-541b96151f40
~~~

Subsequently, after creating more VMs from the golden image, it hit the hard limit:


~~~
I0815 15:11:45.211253       1 controllerserver.go:536] ID: 3701 Req-ID: snapshot-ecd44693-7d3c-4021-aa39-502857abce2a snapshots count 454 on image: orion/csi-vol-577fa68a-4b37-479a-bf58-023a320af82c reached configured hard limit 450



# rbd info  csi-vol-577fa68a-4b37-479a-bf58-023a320af82c  --pool orion
rbd image 'csi-vol-577fa68a-4b37-479a-bf58-023a320af82c':
	size 30 GiB in 7680 objects
	order 22 (4 MiB objects)
	snapshot_count: 454                 <===
	id: 27cab14fb82d74
	block_name_prefix: rbd_data.27cab14fb82d74
	format: 2
	features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, operations
	op_features: clone-parent, snap-trash
	flags:
	create_timestamp: Tue Aug  1 09:52:03 2023
	access_timestamp: Tue Aug 15 11:30:17 2023
	modify_timestamp: Tue Aug  1 09:52:03 2023

~~~

So the newer volumesnapshots have to indefinitely wait.

Shouldn't it run flatten on the cloned image rather than the temporary snapshot in the trash?

~~~
# rbd info orion/csi-vol-f6c4ae6d-de88-4840-b278-a226c8fc6942 |grep parent
	parent: orion/csi-snap-37801a60-460d-4ff1-a73d-0fb5f7d7a349@2fdad284-f361-44eb-bcc0-b5f7e2bf11f4 (trash 30fdb86b06e5f9)

# rbd flatten orion/csi-snap-37801a60-460d-4ff1-a73d-0fb5f7d7a349
rbd: error opening image csi-snap-37801a60-460d-4ff1-a73d-0fb5f7d7a349: (2) No such file or directory

# rbd flatten orion/csi-vol-f6c4ae6d-de88-4840-b278-a226c8fc6942
Image flatten: 100% complete...done.
~~~



Version of all relevant components (if applicable):

OpenShift Data Foundation     4.12.5-rhodf


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

The user is not able to create more than 450+ VMs from a golden image.


Is there any workaround available to the best of your knowledge?

Users have to manually flatten the cloned image.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

2.

Can this issue reproducible?

Yes, 100%.


Can this issue reproduce from the UI?

Yes.


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

I reproduced it in OpenShifift Virtualization, but I think it should be also easily reproducible by creating snapshots and then cloning images from the snapshot, 450+ times.

I created 450+ VMs using the below for loop:

# for i in {1..460};do virtctl create vm --volume-datasource=src:openshift-virtualization-os-images/rhel9 |oc create -f -;sleep 10;done 


Actual results:

The automatic flattening of snapshots is not working.

Expected results:

It should start flattening the snapshots automatically when the number of snapshots of the image reaches 250.

Additional info:

[1] https://github.com/kubevirt/containerized-data-importer/blob/main/doc/smart-clone.md#smart-cloning

Note You need to log in before you can comment on or make changes to this bug.