2256597 – Ceph reports 'MDSs report oversized cache' warning, yet there is no observed alert for high MDS cache usage

Bug 2256597 - Ceph reports 'MDSs report oversized cache' warning, yet there is no observed alert for high MDS cache usage

Summary: Ceph reports 'MDSs report oversized cache' warning, yet there is no observed ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-operator
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Santosh Pillai
QA Contact:	Nagendra Reddy
Docs Contact:
URL:
Whiteboard:	verification-blocked
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-01-03 10:54 UTC by Prasad Desala
Modified:	2024-03-19 15:30 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.15.0-110
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-19 15:30:29 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2361	None	open	use correct units for MDSCacheUsageHigh alerts	2024-01-04 04:19:02 UTC
Github	red-hat-storage ocs-operator pull 2365	None	open	Bug 2256597: [release-4.15] use correct units for MDSCacheUsageHigh alerts	2024-01-04 11:45:00 UTC
Red Hat Knowledge Base (Solution)	5920011	None	None	None	2024-01-24 20:35:44 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:30:32 UTC

Description Prasad Desala 2024-01-03 10:54:48 UTC

Description of problem (please be detailed as possible and provide log
snippests):
===================================================================================
Ceph reported a 'MDSs report oversized cache' warning on the cluster, but the expected high MDS cache usage alert was not observed.

cephsh-5.1$ ceph -s
  cluster:
    id:     5ce81388-45c5-4835-8d83-e1bf5cc310ba
    health: HEALTH_WARN
            1 MDSs report oversized cache
 
  services:
    mon: 3 daemons, quorum a,b,c (age 80m)
    mgr: a(active, since 4h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 21h), 3 in (since 21h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 2.50M objects, 12 GiB
    usage:   78 GiB used, 222 GiB / 300 GiB avail
    pgs:     169 active+clean
 
  io:
    client:   33 MiB/s rd, 331 KiB/s wr, 75 op/s rd, 24 op/s wr
 

Version of all relevant components (if applicable):
ODF: v4.15.0-102

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No that I am aware of

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
No, this is a new feature in 4.15

Steps to Reproduce:
===================
1) Create 3m, 3w OCP cluster and install ODF on it.
2) Create multiple cephfs PVCs with RWX access mode
3) Attach multiple pods to those PVCs and start continuous file creation + meta data operations 

Actual results:
===============
After some time of continuous meta data operations from the application pods, ceph status will go to warning state  reporting that 1 MDSs report oversized cache

Expected results:
=================
We should get an high MDS cache usage alert before ceph sends MDS cache oversize health warning.

Comment 4 Santosh Pillai 2024-01-03 15:00:48 UTC

ceph-mds-mem-rss is not enabled by default. The changes to enable this metric by default got a bit delayed in ceph due to some build issue. (BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2256637) 
I've used a workaround to enable the metrics now. Now waiting the QE tests to complete to see if the alert gets triggered.

Comment 11 Nagendra Reddy 2024-01-29 14:36:27 UTC

Verified with below kit. We can see the alert when the MDSCacheUsage exceeded above 95%. Please refer attached screenshot for more info.


MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-a has exceeded above 95% of the requested value. Increase the memory request for mds.ocs-storagecluster-cephfilesystem-a pod.

MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-b has exceeded above 95% of the requested value. Increase the memory request for mds.ocs-storagecluster-cephfilesystem-b pod.

Comment 12 Nagendra Reddy 2024-01-29 14:38:33 UTC

Adding to comment #11, used below kit for verification.

ODF = 4.15.0-126
4.15.0-0.nightly-2024-01-25-051548

Comment 17 errata-xmlrpc 2024-03-19 15:30:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.