Bug 2256597

Summary: Ceph reports 'MDSs report oversized cache' warning, yet there is no observed alert for high MDS cache usage
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Prasad Desala <tdesala>
Component: odf-operatorAssignee: Santosh Pillai <sapillai>
Status: CLOSED ERRATA QA Contact: Nagendra Reddy <nagreddy>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.15CC: hnallurv, mcaldeir, muagarwa, odf-bz-bot, sapillai, tnielsen
Target Milestone: ---Keywords: TestBlocker
Target Release: ODF 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: verification-blocked
Fixed In Version: 4.15.0-110 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-03-19 15:30:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Prasad Desala 2024-01-03 10:54:48 UTC
Description of problem (please be detailed as possible and provide log
snippests):
===================================================================================
Ceph reported a 'MDSs report oversized cache' warning on the cluster, but the expected high MDS cache usage alert was not observed.

cephsh-5.1$ ceph -s
  cluster:
    id:     5ce81388-45c5-4835-8d83-e1bf5cc310ba
    health: HEALTH_WARN
            1 MDSs report oversized cache
 
  services:
    mon: 3 daemons, quorum a,b,c (age 80m)
    mgr: a(active, since 4h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 21h), 3 in (since 21h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 2.50M objects, 12 GiB
    usage:   78 GiB used, 222 GiB / 300 GiB avail
    pgs:     169 active+clean
 
  io:
    client:   33 MiB/s rd, 331 KiB/s wr, 75 op/s rd, 24 op/s wr
 

Version of all relevant components (if applicable):
ODF: v4.15.0-102

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No that I am aware of

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
No, this is a new feature in 4.15

Steps to Reproduce:
===================
1) Create 3m, 3w OCP cluster and install ODF on it.
2) Create multiple cephfs PVCs with RWX access mode
3) Attach multiple pods to those PVCs and start continuous file creation + meta data operations 

Actual results:
===============
After some time of continuous meta data operations from the application pods, ceph status will go to warning state  reporting that 1 MDSs report oversized cache

Expected results:
=================
We should get an high MDS cache usage alert before ceph sends MDS cache oversize health warning.

Comment 4 Santosh Pillai 2024-01-03 15:00:48 UTC
ceph-mds-mem-rss is not enabled by default. The changes to enable this metric by default got a bit delayed in ceph due to some build issue. (BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2256637) 
I've used a workaround to enable the metrics now. Now waiting the QE tests to complete to see if the alert gets triggered.

Comment 11 Nagendra Reddy 2024-01-29 14:36:27 UTC
Verified with below kit. We can see the alert when the MDSCacheUsage exceeded above 95%. Please refer attached screenshot for more info.


MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-a has exceeded above 95% of the requested value. Increase the memory request for mds.ocs-storagecluster-cephfilesystem-a pod.

MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-b has exceeded above 95% of the requested value. Increase the memory request for mds.ocs-storagecluster-cephfilesystem-b pod.

Comment 12 Nagendra Reddy 2024-01-29 14:38:33 UTC
Adding to comment #11, used below kit for verification.

ODF = 4.15.0-126
4.15.0-0.nightly-2024-01-25-051548

Comment 17 errata-xmlrpc 2024-03-19 15:30:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383