This feature has been in RHCS 5 for a while but has not been documented or advised to be turned on at the CSI level. Let's do that for the next major release? Documentation is in-flight here: https://github.com/ceph/ceph/pull/43896 Along with this change, we can increase max_mds on some or all file systems via Rook.
Also a little explanation for why we should want this: Currently we're limiting our Ceph file systems to a single active MDS. For some customers, this present a bottleneck for metadata I/O. Also, using a single rank increases the amount of metadata that must be cached when there are many clients. This can increase failover times. We've already seen one instance where the sole active MDS had so much metadata to load in cache that it would OOM [1]. (Note, this is fixed in RHCS 5 with a new configuration option but could have been avoided by using more MDS ranks.) Increasing max_mds=2 and using two ranks is not the only change required however. It's known that the default automatic balancer is prone to poor behavior which is why we have pinning policies [2] to control how metadata/subtrees are distributed. The "distributed" policy makes the most sense for CSI as it automatically stripes the subvolumes (PVs) across multiple MDS automatically. It will always result in a net improvement over a single rank file system with minimal additional technical risk. [1] bz2020767 [2] https://docs.ceph.com/en/pacific/cephfs/multimds/#setting-subtree-partitioning-policies
Travis, `max_mds` seems like an option that Rook needs to set when creating the CephFilesystem. Is that something that is done already, or has that been requested as a feature before? In case Rook does not have the `max_mds` option yet, this RFE should be split in at least two upstream issues: 1. rook: add the `max_mds` CephFS option 2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a new volume ODF-4.10 planning is concluded, so this will be considered for ODF-4.11.
Yes, Rook currently sets max_mds to the desired number of mds active daemons, which by default (and with OCS) should be 1. See https://github.com/rook/rook/blob/990d92790b58d455cd28bf9773685b3540ff5bf0/pkg/daemon/ceph/client/filesystem.go#L179-L182
(In reply to Niels de Vos from comment #4) > 2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a > new volume correction: `ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1`
(In reply to Patrick Donnelly from comment #6) > (In reply to Niels de Vos from comment #4) > > 2. ceph-csi: call `setfattr -n ceph.dir.pin -v 2` (I think) when create a > > new volume > > correction: `ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1` Thanks. This should be doable from ceph csi side.
https://github.com/ceph/ceph-csi/issues/2637 cover this in Ceph CSI.
Created an epic for this. https://issues.redhat.com/browse/RHSTOR-3270
>hchiramm Could you please prioritize this? the support has to be added in go-ceph and then consume it in Ceph CSI, We have trackers in upstream repos for the same. For downstream side of things which this bz covers, this is not part of ODF 4.13.
Happy path of the feature ( internal mode cluster) has passed in 4.15.0-99. Will update on the external mode.
Manual testing is completed ( external and regular modes) , also automated test is added : https://github.com/red-hat-storage/ocs-ci/pull/9369.
tests/functional/storageclass/test_csi_subvolume_group_property.py - this is the new automated scenario.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383