Description of problem (please be detailed as possible and provide log snippests): =================================================================================== Ceph reported a 'MDSs report oversized cache' warning on the cluster, but the expected high MDS cache usage alert was not observed. cephsh-5.1$ ceph -s cluster: id: 5ce81388-45c5-4835-8d83-e1bf5cc310ba health: HEALTH_WARN 1 MDSs report oversized cache services: mon: 3 daemons, quorum a,b,c (age 80m) mgr: a(active, since 4h), standbys: b mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 21h), 3 in (since 21h) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 2.50M objects, 12 GiB usage: 78 GiB used, 222 GiB / 300 GiB avail pgs: 169 active+clean io: client: 33 MiB/s rd, 331 KiB/s wr, 75 op/s rd, 24 op/s wr Version of all relevant components (if applicable): ODF: v4.15.0-102 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? No that I am aware of Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: No, this is a new feature in 4.15 Steps to Reproduce: =================== 1) Create 3m, 3w OCP cluster and install ODF on it. 2) Create multiple cephfs PVCs with RWX access mode 3) Attach multiple pods to those PVCs and start continuous file creation + meta data operations Actual results: =============== After some time of continuous meta data operations from the application pods, ceph status will go to warning state reporting that 1 MDSs report oversized cache Expected results: ================= We should get an high MDS cache usage alert before ceph sends MDS cache oversize health warning.
ceph-mds-mem-rss is not enabled by default. The changes to enable this metric by default got a bit delayed in ceph due to some build issue. (BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2256637) I've used a workaround to enable the metrics now. Now waiting the QE tests to complete to see if the alert gets triggered.
Verified with below kit. We can see the alert when the MDSCacheUsage exceeded above 95%. Please refer attached screenshot for more info. MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-a has exceeded above 95% of the requested value. Increase the memory request for mds.ocs-storagecluster-cephfilesystem-a pod. MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-b has exceeded above 95% of the requested value. Increase the memory request for mds.ocs-storagecluster-cephfilesystem-b pod.
Adding to comment #11, used below kit for verification. ODF = 4.15.0-126 4.15.0-0.nightly-2024-01-25-051548
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383