Bug 2022467

Summary:	[RFE] enable distributed ephemeral pins on "csi" subvolume group
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Patrick Donnelly <pdonnell>
Component:	rook	Assignee:	Parth Arora <paarora>
Status:	CLOSED ERRATA	QA Contact:	Yuli Persky <ypersky>
Severity:	low	Docs Contact:
Priority:	medium
Version:	unspecified	CC:	ebenahar, etamir, kbg, mmuench, muagarwa, ndevos, odf-bz-bot, owasserm, paarora, rar, tnielsen, vshankar
Target Milestone:	---	Keywords:	FutureFeature, Performance, Reopened
Target Release:	ODF 4.15.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	4.15.0-103	Doc Type:	Enhancement
Doc Text:	.Enhanced data distribution for CephFS storage class This feature enables the default subvolume groups of Container Storage Interface (CSI) to be automatically pinned to the ranks according to the default pinning configuration. This is useful when you have multiple active CephFS metadata servers (MDSs) in the cluster. This helps to better distribute the load across MDS ranks in stable and predictable ways.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-03-19 15:19:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1759702
Bug Blocks:	2246375

Description Patrick Donnelly 2021-11-11 17:38:40 UTC

This feature has been in RHCS 5 for a while but has not been documented or advised to be turned on at the CSI level. Let's do that for the next major release? Documentation is in-flight here:

https://github.com/ceph/ceph/pull/43896

Along with this change, we can increase max_mds on some or all file systems via Rook.

Comment 3 Patrick Donnelly 2021-11-11 17:45:18 UTC

Also a little explanation for why we should want this:

Currently we're limiting our Ceph file systems to a single active MDS. For some customers, this present a bottleneck for metadata I/O. Also, using a single rank increases the amount of metadata that must be cached when there are many clients. This can increase failover times. We've already seen one instance where the sole active MDS had so much metadata to load in cache that it would OOM [1]. (Note, this is fixed in RHCS 5 with a new configuration option but could have been avoided by using more MDS ranks.)

Increasing max_mds=2 and using two ranks is not the only change required however. It's known that the default automatic balancer is prone to poor behavior which is why we have pinning policies [2] to control how metadata/subtrees are distributed. The "distributed" policy makes the most sense for CSI as it automatically stripes the subvolumes (PVs) across multiple MDS automatically. It will always result in a net improvement over a single rank file system with minimal additional technical risk.

[1] bz2020767
[2] https://docs.ceph.com/en/pacific/cephfs/multimds/#setting-subtree-partitioning-policies

Comment 4 Niels de Vos 2021-11-12 16:36:16 UTC

Travis, `max_mds` seems like an option that Rook needs to set when creating the CephFilesystem. Is that something that is done already, or has that been requested as a feature before?

In case Rook does not have the `max_mds` option yet, this RFE should be split in at least two upstream issues:

1. rook: add the `max_mds` CephFS option
2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a new volume

ODF-4.10 planning is concluded, so this will be considered for ODF-4.11.

Comment 5 Travis Nielsen 2021-11-12 17:39:52 UTC

Yes, Rook currently sets max_mds to the desired number of mds active daemons, which by default (and with OCS) should be 1. 
See https://github.com/rook/rook/blob/990d92790b58d455cd28bf9773685b3540ff5bf0/pkg/daemon/ceph/client/filesystem.go#L179-L182

Comment 6 Patrick Donnelly 2021-11-12 18:06:10 UTC

(In reply to Niels de Vos from comment #4)
> 2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a
> new volume

correction: `ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1`

Comment 7 Humble Chirammal 2021-11-15 04:22:47 UTC

(In reply to Patrick Donnelly from comment #6)
> (In reply to Niels de Vos from comment #4)
> > 2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a
> > new volume
> 
> correction: `ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1`

Thanks. This should be doable from ceph csi side.

Comment 8 Humble Chirammal 2021-11-26 06:18:46 UTC

https://github.com/ceph/ceph-csi/issues/2637 cover this in Ceph CSI.

Comment 11 Mudit Agarwal 2022-05-26 09:39:27 UTC

Created an epic for this. 
https://issues.redhat.com/browse/RHSTOR-3270

Comment 13 Humble Chirammal 2023-01-19 04:44:49 UTC

>hchiramm Could you please prioritize this?

the support has to be added in go-ceph and then consume it in Ceph CSI, We have trackers in upstream repos for the same. 
For downstream side of things which this bz covers, this is not part of ODF 4.13.

Comment 33 Yuli Persky 2024-01-16 16:18:49 UTC

Happy path of the feature ( internal mode cluster) has passed in 4.15.0-99. Will update on the external mode.

Comment 34 Yuli Persky 2024-02-28 22:37:28 UTC

Manual testing is completed ( external and regular modes) , also automated test is added : https://github.com/red-hat-storage/ocs-ci/pull/9369.

Comment 35 Yuli Persky 2024-02-28 22:56:02 UTC

tests/functional/storageclass/test_csi_subvolume_group_property.py  - this is the new automated scenario.

Comment 37 errata-xmlrpc 2024-03-19 15:19:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383