2022467 – [RFE] enable distributed ephemeral pins on "csi" subvolume group

Bug 2022467 - [RFE] enable distributed ephemeral pins on "csi" subvolume group

Summary: [RFE] enable distributed ephemeral pins on "csi" subvolume group

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	unspecified
Hardware:	All
OS:	All
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Parth Arora
QA Contact:	Yuli Persky
Docs Contact:
URL:
Whiteboard:
Depends On:	1759702
Blocks:	2246375
TreeView+	depends on / blocked

Reported:	2021-11-11 17:38 UTC by Patrick Donnelly
Modified:	2024-03-19 15:19 UTC (History)
CC List:	12 users (show)
Fixed In Version:	4.15.0-103
Doc Type:	Enhancement
Doc Text:	.Enhanced data distribution for CephFS storage class This feature enables the default subvolume groups of Container Storage Interface (CSI) to be automatically pinned to the ranks according to the default pinning configuration. This is useful when you have multiple active CephFS metadata servers (MDSs) in the cluster. This helps to better distribute the load across MDS ranks in stable and predictable ways.
Clone Of:
Environment:
Last Closed:	2024-03-19 15:19:40 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	53236	None	None	None	2021-11-11 17:38:39 UTC
Github	ceph ceph-csi issues 2637	None	open	Support Subvolume{Group} pinning in ceph csi	2021-11-15 08:55:11 UTC
Github	ceph go-ceph issues 611	None	open	cephfs admin: wrap subvolume pinning API	2021-11-15 16:42:42 UTC
Github	rook rook pull 12477	None	Merged	external: pin the default csi subvolume	2023-07-25 10:47:36 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:19:47 UTC

Description Patrick Donnelly 2021-11-11 17:38:40 UTC

This feature has been in RHCS 5 for a while but has not been documented or advised to be turned on at the CSI level. Let's do that for the next major release? Documentation is in-flight here:

https://github.com/ceph/ceph/pull/43896

Along with this change, we can increase max_mds on some or all file systems via Rook.

Comment 3 Patrick Donnelly 2021-11-11 17:45:18 UTC

Also a little explanation for why we should want this:

Currently we're limiting our Ceph file systems to a single active MDS. For some customers, this present a bottleneck for metadata I/O. Also, using a single rank increases the amount of metadata that must be cached when there are many clients. This can increase failover times. We've already seen one instance where the sole active MDS had so much metadata to load in cache that it would OOM [1]. (Note, this is fixed in RHCS 5 with a new configuration option but could have been avoided by using more MDS ranks.)

Increasing max_mds=2 and using two ranks is not the only change required however. It's known that the default automatic balancer is prone to poor behavior which is why we have pinning policies [2] to control how metadata/subtrees are distributed. The "distributed" policy makes the most sense for CSI as it automatically stripes the subvolumes (PVs) across multiple MDS automatically. It will always result in a net improvement over a single rank file system with minimal additional technical risk.

[1] bz2020767
[2] https://docs.ceph.com/en/pacific/cephfs/multimds/#setting-subtree-partitioning-policies

Comment 4 Niels de Vos 2021-11-12 16:36:16 UTC

Travis, `max_mds` seems like an option that Rook needs to set when creating the CephFilesystem. Is that something that is done already, or has that been requested as a feature before?

In case Rook does not have the `max_mds` option yet, this RFE should be split in at least two upstream issues:

1. rook: add the `max_mds` CephFS option
2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a new volume

ODF-4.10 planning is concluded, so this will be considered for ODF-4.11.

Comment 5 Travis Nielsen 2021-11-12 17:39:52 UTC

Yes, Rook currently sets max_mds to the desired number of mds active daemons, which by default (and with OCS) should be 1. 
See https://github.com/rook/rook/blob/990d92790b58d455cd28bf9773685b3540ff5bf0/pkg/daemon/ceph/client/filesystem.go#L179-L182

Comment 6 Patrick Donnelly 2021-11-12 18:06:10 UTC

(In reply to Niels de Vos from comment #4)
> 2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a
> new volume

correction: `ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1`

Comment 7 Humble Chirammal 2021-11-15 04:22:47 UTC

(In reply to Patrick Donnelly from comment #6)
> (In reply to Niels de Vos from comment #4)
> > 2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a
> > new volume
> 
> correction: `ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1`

Thanks. This should be doable from ceph csi side.

Comment 8 Humble Chirammal 2021-11-26 06:18:46 UTC

https://github.com/ceph/ceph-csi/issues/2637 cover this in Ceph CSI.

Comment 11 Mudit Agarwal 2022-05-26 09:39:27 UTC

Created an epic for this. 
https://issues.redhat.com/browse/RHSTOR-3270

Comment 13 Humble Chirammal 2023-01-19 04:44:49 UTC

>hchiramm Could you please prioritize this?

the support has to be added in go-ceph and then consume it in Ceph CSI, We have trackers in upstream repos for the same. 
For downstream side of things which this bz covers, this is not part of ODF 4.13.

Comment 33 Yuli Persky 2024-01-16 16:18:49 UTC

Happy path of the feature ( internal mode cluster) has passed in 4.15.0-99. Will update on the external mode.

Comment 34 Yuli Persky 2024-02-28 22:37:28 UTC

Manual testing is completed ( external and regular modes) , also automated test is added : https://github.com/red-hat-storage/ocs-ci/pull/9369.

Comment 35 Yuli Persky 2024-02-28 22:56:02 UTC

tests/functional/storageclass/test_csi_subvolume_group_property.py  - this is the new automated scenario.

Comment 37 errata-xmlrpc 2024-03-19 15:19:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.