Bug 2196829 - Support to reject CephFS clones if cloner threads are not available.
Summary: Support to reject CephFS clones if cloner threads are not available.
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 6.2
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 6.2
Assignee: Neeraj Pratap Singh
QA Contact: Hemanth Kumar
URL:
Whiteboard:
Depends On:
Blocks: 2190161 2238406
TreeView+ depends on / blocked
 
Reported: 2023-05-10 11:20 UTC by Neeraj Pratap Singh
Modified: 2023-09-11 17:32 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2190161
Environment:
Last Closed:
Embargoed:
neesingh: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 59714 0 None None None 2023-05-11 05:49:14 UTC
Red Hat Bugzilla 2210388 0 unspecified CLOSED VolumeSnapshot Content cannot be Deleted 2024-02-26 18:25:13 UTC
Red Hat Issue Tracker RHCEPH-6637 0 None None None 2023-05-10 11:22:57 UTC

Internal Links: 2148365

Description Neeraj Pratap Singh 2023-05-10 11:20:05 UTC
+++ This bug was initially created as a clone of Bug #2190161 +++

Description of problem (please be detailed as possible and provide log
snippests):

1. CephFS clone creation have a limit of 4 parallel clones at a time and rest
of the clone create requests are queued. This makes CephFS cloning very slow when there is large amount of clones being created.

2. CephCSI/Kubernetes storage does have a mechanism to delete in-progress clones and deletion of corresponding kubernetes object pvc may lead to stale resource.

Due to the above reasons, we are seeing a lot of customer cases with stale
cephfs clones.
Example: 
- https://bugzilla.redhat.com/show_bug.cgi?id=2148365
- https://bugzilla.redhat.com/show_bug.cgi?id=2160422
- https://bugzilla.redhat.com/show_bug.cgi?id=2182168

This situation requires manual clean up. 

Version of all relevant components (if applicable):
All supported ODF backing ceph versions

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

Steps to Reproduce:
1. create a PVC
2. create a snapshot and restore it to pvc
3. repeat step 2 many times
4. delete all PVCs and snapshot, restart cephfs provisioner pod.

Actual results:
A lot of in-progress cephfs clones and completed clone stale resources.

Expected results:
Zero in-progress cephfs clones or stale resources.

Additional info:

We can avoid this situation if cephfs clone create requests are not queued as it
happens currently.

Preferable solutions: 

1. Have `ceph fs subvolume snapshot clone` command reject clones if number of in-progress clones == max_concurrent_clones and provide a flag to allow pending clones
`--max_pending_clones=<int>`(default value 0)

2. Add an option to `ceph fs subvolume snapshot clone` to limit number of pending clones `--max_pending_clones=<int>`(default value infinity).
(Option 2 would require changes at cephcsi and maybe other components).

CephCSI and Kuberenetes Storage inherently will retry request with exponential backoff
so even if few requests fail, there will be a retry and eventual completion of cephfs clone.

By not allowing clones to be queued up in pending state, we avoid stale resources and 
inform users exactly why the clones are taking so much time.

--- Additional comment from RHEL Program Management on 2023-04-27 10:50:16 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Madhu Rajanna on 2023-04-27 11:09:26 UTC ---



Yes, this is a kind of temporary fix until ceph fixes the clone problem.

>1. Have `ceph fs subvolume snapshot clone` command reject clones if number of in-progress clones == max_concurrent_clones and provide a flag to allow pending clones
`--max_pending_clones=<int>`(default value 0)

IMO The decision on the count should be made at the Ceph config level, not at this point.

>2. Add an option to `ceph fs subvolume snapshot clone` to limit number of pending clones `--max_pending_clones=<int>`(default value infinity).
(Option 2 would require changes at cephcsi and maybe other components).


IMO The decision on the count should be made at the Ceph config level, not at this point.


I would go for one more option where ceph can provide details like clone thread/queue is full or not



```
3. ceph fs clone thread/queue status

* Empty
* In-progress
* Full
```
Based on the above state client(cephcsi) will request the PVC clone status until the status is empty/in progress. This needs extra work at cephcsi, where we need to
check the state of the ceph cluster before the clone operation.

or

```
4. ceph fs subvolume snapshot clone --reject-if-queue-is-full
```

With this one, cephcsi will still create/delete the rados objects for reach requests.

I will go for option 3 as its an extra logic at cephcsi, but it avoids extra create/delete operations

--- Additional comment from Venky Shankar on 2023-04-27 12:23:27 UTC ---

(In reply to Madhu Rajanna from comment #2)
> 
> Yes, this is a kind of temporary fix until ceph fixes the clone problem.
> 
> >1. Have `ceph fs subvolume snapshot clone` command reject clones if number of in-progress clones == max_concurrent_clones and provide a flag to allow pending clones
> `--max_pending_clones=<int>`(default value 0)
> 
> IMO The decision on the count should be made at the Ceph config level, not
> at this point.
> 
> >2. Add an option to `ceph fs subvolume snapshot clone` to limit number of pending clones `--max_pending_clones=<int>`(default value infinity).
> (Option 2 would require changes at cephcsi and maybe other components).
> 
> 
> IMO The decision on the count should be made at the Ceph config level, not
> at this point.
> 
> 
> I would go for one more option where ceph can provide details like clone
> thread/queue is full or not
> 
> 
> 
> ```
> 3. ceph fs clone thread/queue status
> 
> * Empty
> * In-progress
> * Full
> ```

Is there a need for this interface? `ceph fs subvolume snapshot clone` can either accept the clone request or return back EAGAIN (when no threads are free for cloning). Wouldn't that suffice?

> Based on the above state client(cephcsi) will request the PVC clone status
> until the status is empty/in progress. This needs extra work at cephcsi,
> where we need to
> check the state of the ceph cluster before the clone operation.
> 
> or
> 
> ```
> 4. ceph fs subvolume snapshot clone --reject-if-queue-is-full
> ```
> 
> With this one, cephcsi will still create/delete the rados objects for reach
> requests.
> 
> I will go for option 3 as its an extra logic at cephcsi, but it avoids extra
> create/delete operations

Is the object create/rm problematic? I would like if we can avoid an extra interface for this...

--- Additional comment from Madhu Rajanna on 2023-04-27 12:47:15 UTC ---

(In reply to Venky Shankar from comment #3)
> (In reply to Madhu Rajanna from comment #2)
> > 
> > Yes, this is a kind of temporary fix until ceph fixes the clone problem.
> > 
> > >1. Have `ceph fs subvolume snapshot clone` command reject clones if number of in-progress clones == max_concurrent_clones and provide a flag to allow pending clones
> > `--max_pending_clones=<int>`(default value 0)
> > 
> > IMO The decision on the count should be made at the Ceph config level, not
> > at this point.
> > 
> > >2. Add an option to `ceph fs subvolume snapshot clone` to limit number of pending clones `--max_pending_clones=<int>`(default value infinity).
> > (Option 2 would require changes at cephcsi and maybe other components).
> > 
> > 
> > IMO The decision on the count should be made at the Ceph config level, not
> > at this point.
> > 
> > 
> > I would go for one more option where ceph can provide details like clone
> > thread/queue is full or not
> > 
> > 
> > 
> > ```
> > 3. ceph fs clone thread/queue status
> > 
> > * Empty
> > * In-progress
> > * Full
> > ```
> 
> Is there a need for this interface? `ceph fs subvolume snapshot clone` can
> either accept the clone request or return back EAGAIN (when no threads are
> free for cloning). Wouldn't that suffice?
> 

yes it should be sufficient for us if cephfs returns error


> > Based on the above state client(cephcsi) will request the PVC clone status
> > until the status is empty/in progress. This needs extra work at cephcsi,
> > where we need to
> > check the state of the ceph cluster before the clone operation.
> > 
> > or
> > 
> > ```
> > 4. ceph fs subvolume snapshot clone --reject-if-queue-is-full
> > ```
> > 
> > With this one, cephcsi will still create/delete the rados objects for reach
> > requests.
> > 
> > I will go for option 3 as its an extra logic at cephcsi, but it avoids extra
> > create/delete operations
> 
> Is the object create/rm problematic? I would like if we can avoid an extra
> interface for this...

It more of performance, as its a tread off we are fine if cephfs returns error if the clone thread is full

--- Additional comment from Sunil Kumar Acharya on 2023-05-02 15:21:07 UTC ---

This BZ is neither marked as blocker for ODF-4.13 nor received any acks from development hence moving the BZ out of 4.13 as part of Dev Freeze. Feel free to propose it back to ODF-4.13 if this is critial/blocker issue for the release with a note.

--- Additional comment from Venky Shankar on 2023-05-03 07:48:43 UTC ---

Neeraj, please create a tracker ticket for this. Also, this requires a cloned BZ for Ceph component.


Note You need to log in before you can comment on or make changes to this bug.