Bug 2022467 - [RFE] enable distributed ephemeral pins on "csi" subvolume group
Summary: [RFE] enable distributed ephemeral pins on "csi" subvolume group
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: unspecified
Hardware: All
OS: All
medium
low
Target Milestone: ---
: ODF 4.15.0
Assignee: Parth Arora
QA Contact: Yuli Persky
URL:
Whiteboard:
Depends On: 1759702
Blocks: 2246375
TreeView+ depends on / blocked
 
Reported: 2021-11-11 17:38 UTC by Patrick Donnelly
Modified: 2024-03-19 15:19 UTC (History)
12 users (show)

Fixed In Version: 4.15.0-103
Doc Type: Enhancement
Doc Text:
.Enhanced data distribution for CephFS storage class This feature enables the default subvolume groups of Container Storage Interface (CSI) to be *automatically* pinned to the ranks according to the default pinning configuration. This is useful when you have multiple active CephFS metadata servers (MDSs) in the cluster. This helps to better distribute the load across MDS ranks in stable and predictable ways.
Clone Of:
Environment:
Last Closed: 2024-03-19 15:19:40 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 53236 0 None None None 2021-11-11 17:38:39 UTC
Github ceph ceph-csi issues 2637 0 None open Support Subvolume{Group} pinning in ceph csi 2021-11-15 08:55:11 UTC
Github ceph go-ceph issues 611 0 None open cephfs admin: wrap subvolume pinning API 2021-11-15 16:42:42 UTC
Github rook rook pull 12477 0 None Merged external: pin the default csi subvolume 2023-07-25 10:47:36 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:19:47 UTC

Description Patrick Donnelly 2021-11-11 17:38:40 UTC
This feature has been in RHCS 5 for a while but has not been documented or advised to be turned on at the CSI level. Let's do that for the next major release? Documentation is in-flight here:

https://github.com/ceph/ceph/pull/43896

Along with this change, we can increase max_mds on some or all file systems via Rook.

Comment 3 Patrick Donnelly 2021-11-11 17:45:18 UTC
Also a little explanation for why we should want this:

Currently we're limiting our Ceph file systems to a single active MDS. For some customers, this present a bottleneck for metadata I/O. Also, using a single rank increases the amount of metadata that must be cached when there are many clients. This can increase failover times. We've already seen one instance where the sole active MDS had so much metadata to load in cache that it would OOM [1]. (Note, this is fixed in RHCS 5 with a new configuration option but could have been avoided by using more MDS ranks.)

Increasing max_mds=2 and using two ranks is not the only change required however. It's known that the default automatic balancer is prone to poor behavior which is why we have pinning policies [2] to control how metadata/subtrees are distributed. The "distributed" policy makes the most sense for CSI as it automatically stripes the subvolumes (PVs) across multiple MDS automatically. It will always result in a net improvement over a single rank file system with minimal additional technical risk.

[1] bz2020767
[2] https://docs.ceph.com/en/pacific/cephfs/multimds/#setting-subtree-partitioning-policies

Comment 4 Niels de Vos 2021-11-12 16:36:16 UTC
Travis, `max_mds` seems like an option that Rook needs to set when creating the CephFilesystem. Is that something that is done already, or has that been requested as a feature before?

In case Rook does not have the `max_mds` option yet, this RFE should be split in at least two upstream issues:

1. rook: add the `max_mds` CephFS option
2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a new volume

ODF-4.10 planning is concluded, so this will be considered for ODF-4.11.

Comment 5 Travis Nielsen 2021-11-12 17:39:52 UTC
Yes, Rook currently sets max_mds to the desired number of mds active daemons, which by default (and with OCS) should be 1. 
See https://github.com/rook/rook/blob/990d92790b58d455cd28bf9773685b3540ff5bf0/pkg/daemon/ceph/client/filesystem.go#L179-L182

Comment 6 Patrick Donnelly 2021-11-12 18:06:10 UTC
(In reply to Niels de Vos from comment #4)
> 2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a
> new volume

correction: `ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1`

Comment 7 Humble Chirammal 2021-11-15 04:22:47 UTC
(In reply to Patrick Donnelly from comment #6)
> (In reply to Niels de Vos from comment #4)
> > 2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a
> > new volume
> 
> correction: `ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1`

Thanks. This should be doable from ceph csi side.

Comment 8 Humble Chirammal 2021-11-26 06:18:46 UTC
https://github.com/ceph/ceph-csi/issues/2637 cover this in Ceph CSI.

Comment 11 Mudit Agarwal 2022-05-26 09:39:27 UTC
Created an epic for this. 
https://issues.redhat.com/browse/RHSTOR-3270

Comment 13 Humble Chirammal 2023-01-19 04:44:49 UTC
>hchiramm Could you please prioritize this?

the support has to be added in go-ceph and then consume it in Ceph CSI, We have trackers in upstream repos for the same. 
For downstream side of things which this bz covers, this is not part of ODF 4.13.

Comment 33 Yuli Persky 2024-01-16 16:18:49 UTC
Happy path of the feature ( internal mode cluster) has passed in 4.15.0-99. Will update on the external mode.

Comment 34 Yuli Persky 2024-02-28 22:37:28 UTC
Manual testing is completed ( external and regular modes) , also automated test is added : https://github.com/red-hat-storage/ocs-ci/pull/9369.

Comment 35 Yuli Persky 2024-02-28 22:56:02 UTC
tests/functional/storageclass/test_csi_subvolume_group_property.py  - this is the new automated scenario.

Comment 37 errata-xmlrpc 2024-03-19 15:19:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.