2190161 – Support to reject CephFS clones if cloner threads are not available.

Bug 2190161 - Support to reject CephFS clones if cloner threads are not available.

Summary: Support to reject CephFS clones if cloner threads are not available.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.17.0
Assignee:	Neeraj Pratap Singh
QA Contact:	Yuli Persky
Docs Contact:
URL:
Whiteboard:
Depends On:	2196829
Blocks:	2281703 2290711
TreeView+	depends on / blocked

Reported:	2023-04-27 10:50 UTC by Rakshith
Modified:	2024-10-30 14:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.17.0-100
Doc Type:	Enhancement
Doc Text:	.New restored or cloned CephFS PVC creation no longer slows down due to parallel clone limit Previously, upon reaching the limit of parallel CephFS clones, the rest of the clones would queue up, slowing down the cloning. With this enhancement, upon reaching the limit of parallel clones at one time, the new clone creation requests are rejected. The default parallel clone creation limit is 4. To increase the limit, contact customer support.
Clone Of:
Clones:	2196829 2290711 (view as bug list)
Environment:
Last Closed:	2024-10-30 14:25:27 UTC
Embargoed:
Flags:	rar: needinfo- rar: needinfo- rar: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2148365	1	None	None	None	2024-06-27 07:14:10 UTC
Red Hat Bugzilla	2160422	1	None	None	None	2024-06-27 07:14:10 UTC
Red Hat Bugzilla	2182168	0	unspecified	CLOSED	unable to remove cephfs clones	2023-11-11 00:45:15 UTC
Red Hat Product Errata	RHSA-2024:8676	0	None	None	None	2024-10-30 14:25:44 UTC

Description Rakshith 2023-04-27 10:50:05 UTC

Description of problem (please be detailed as possible and provide log
snippests):

1. CephFS clone creation have a limit of 4 parallel clones at a time and rest
of the clone create requests are queued. This makes CephFS cloning very slow when there is large amount of clones being created.

2. CephCSI/Kubernetes storage does have a mechanism to delete in-progress clones and deletion of corresponding kubernetes object pvc may lead to stale resource.

Due to the above reasons, we are seeing a lot of customer cases with stale
cephfs clones.
Example:
- https://bugzilla.redhat.com/show_bug.cgi?id=2148365
- https://bugzilla.redhat.com/show_bug.cgi?id=2160422
- https://bugzilla.redhat.com/show_bug.cgi?id=2182168

This situation requires manual clean up.

Version of all relevant components (if applicable):
All supported ODF backing ceph versions

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

Steps to Reproduce:
1. create a PVC
2. create a snapshot and restore it to pvc
3. repeat step 2 many times
4. delete all PVCs and snapshot, restart cephfs provisioner pod.

Actual results:
A lot of in-progress cephfs clones and completed clone stale resources.

Expected results:
Zero in-progress cephfs clones or stale resources.

Additional info:

We can avoid this situation if cephfs clone create requests are not queued as it
happens currently.

Preferable solutions:

1. Have `ceph fs subvolume snapshot clone` command reject clones if number of in-progress clones == max_concurrent_clones and provide a flag to allow pending clones
`--max_pending_clones=<int>`(default value 0)

2. Add an option to `ceph fs subvolume snapshot clone` to limit number of pending clones `--max_pending_clones=<int>`(default value infinity).
(Option 2 would require changes at cephcsi and maybe other components).

CephCSI and Kuberenetes Storage inherently will retry request with exponential backoff
so even if few requests fail, there will be a retry and eventual completion of cephfs clone.

By not allowing clones to be queued up in pending state, we avoid stale resources and
inform users exactly why the clones are taking so much time.

Comment 20 Sunil Kumar Acharya 2024-09-18 12:06:54 UTC

Please update the RDT flag/text appropriately.

Comment 24 errata-xmlrpc 2024-10-30 14:25:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.17.0 Security, Enhancement, & Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:8676

Note You need to log in before you can comment on or make changes to this bug.