Bug 2256597 - Ceph reports 'MDSs report oversized cache' warning, yet there is no observed alert for high MDS cache usage
Summary: Ceph reports 'MDSs report oversized cache' warning, yet there is no observed ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-operator
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.15.0
Assignee: Santosh Pillai
QA Contact: Nagendra Reddy
URL:
Whiteboard: verification-blocked
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-01-03 10:54 UTC by Prasad Desala
Modified: 2024-03-19 15:30 UTC (History)
6 users (show)

Fixed In Version: 4.15.0-110
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-03-19 15:30:29 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 2361 0 None open use correct units for MDSCacheUsageHigh alerts 2024-01-04 04:19:02 UTC
Github red-hat-storage ocs-operator pull 2365 0 None open Bug 2256597: [release-4.15] use correct units for MDSCacheUsageHigh alerts 2024-01-04 11:45:00 UTC
Red Hat Knowledge Base (Solution) 5920011 0 None None None 2024-01-24 20:35:44 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:30:32 UTC

Description Prasad Desala 2024-01-03 10:54:48 UTC
Description of problem (please be detailed as possible and provide log
snippests):
===================================================================================
Ceph reported a 'MDSs report oversized cache' warning on the cluster, but the expected high MDS cache usage alert was not observed.

cephsh-5.1$ ceph -s
  cluster:
    id:     5ce81388-45c5-4835-8d83-e1bf5cc310ba
    health: HEALTH_WARN
            1 MDSs report oversized cache
 
  services:
    mon: 3 daemons, quorum a,b,c (age 80m)
    mgr: a(active, since 4h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 21h), 3 in (since 21h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 2.50M objects, 12 GiB
    usage:   78 GiB used, 222 GiB / 300 GiB avail
    pgs:     169 active+clean
 
  io:
    client:   33 MiB/s rd, 331 KiB/s wr, 75 op/s rd, 24 op/s wr
 

Version of all relevant components (if applicable):
ODF: v4.15.0-102

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No that I am aware of

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
No, this is a new feature in 4.15

Steps to Reproduce:
===================
1) Create 3m, 3w OCP cluster and install ODF on it.
2) Create multiple cephfs PVCs with RWX access mode
3) Attach multiple pods to those PVCs and start continuous file creation + meta data operations 

Actual results:
===============
After some time of continuous meta data operations from the application pods, ceph status will go to warning state  reporting that 1 MDSs report oversized cache

Expected results:
=================
We should get an high MDS cache usage alert before ceph sends MDS cache oversize health warning.

Comment 4 Santosh Pillai 2024-01-03 15:00:48 UTC
ceph-mds-mem-rss is not enabled by default. The changes to enable this metric by default got a bit delayed in ceph due to some build issue. (BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2256637) 
I've used a workaround to enable the metrics now. Now waiting the QE tests to complete to see if the alert gets triggered.

Comment 11 Nagendra Reddy 2024-01-29 14:36:27 UTC
Verified with below kit. We can see the alert when the MDSCacheUsage exceeded above 95%. Please refer attached screenshot for more info.


MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-a has exceeded above 95% of the requested value. Increase the memory request for mds.ocs-storagecluster-cephfilesystem-a pod.

MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-b has exceeded above 95% of the requested value. Increase the memory request for mds.ocs-storagecluster-cephfilesystem-b pod.

Comment 12 Nagendra Reddy 2024-01-29 14:38:33 UTC
Adding to comment #11, used below kit for verification.

ODF = 4.15.0-126
4.15.0-0.nightly-2024-01-25-051548

Comment 17 errata-xmlrpc 2024-03-19 15:30:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.