2283575 – [5.3.x clone] Cache oversize warning gives 0 values for inodes in use by clients and stray files

Bug 2283575 - [5.3.x clone] Cache oversize warning gives 0 values for inodes in use by clients and stray files [NEEDINFO]

Summary: [5.3.x clone] Cache oversize warning gives 0 values for inodes in use by cli...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	5.3
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.3z9
Assignee:	Rishabh Dave
QA Contact:	Hemanth Kumar
Docs Contact:
URL:
Whiteboard:
Depends On:	2248169 2283576
Blocks:
TreeView+	depends on / blocked

Reported:	2024-05-28 01:53 UTC by Manny
Modified:	2025-10-08 09:13 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2248169
Environment:
Last Closed:	2025-03-05 16:16:07 UTC
Embargoed:
Dependent Products:
Flags:	mcaldeir: needinfo- mcaldeir: needinfo- vshankar: needinfo? (ridave)

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-9096	0	None	None	None	2024-05-28 01:53:51 UTC
Red Hat Knowledge Base (Solution)	7055539	0	None	None	None	2024-05-28 02:06:34 UTC

Description Manny 2024-05-28 01:53:12 UTC

+++ This bug was initially created as a clone of Bug #2248169 +++

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Patrick Donnelly on 2023-11-06 17:01:57 UTC ---

For example:

sh-4.4$ ceph health detail
HEALTH_WARN 1 MDSs report oversized cache
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): MDS cache is too large (6GB/4GB); 0 inodes in use by clients, 0 stray files
sh-4.4$

The counters used are clearly not being updated anymore.

--- Additional comment from Matt See on 2023-11-06 17:21:31 UTC ---

Hey Patrick, thanks for opening up this BZ.

I'd like to eventually write a KCS for this issue, what information would you like me to gather from the customer in addition to what we've already collected? Should we fail over the mds, delete the pods, or gather more information first?

Also, what all is wrong with the alert?

--- Additional comment from Patrick Donnelly on 2023-11-06 18:21:51 UTC ---

(In reply to Matt See from comment #2)
> Hey Patrick, thanks for opening up this BZ.
> 
> I'd like to eventually write a KCS for this issue, what information would
> you like me to gather from the customer in addition to what we've already
> collected? Should we fail over the mds, delete the pods, or gather more
> information first?

I'd like to have this BZ stay focused on the issue described in the title.  Let's move that discussion to a new BZ (please go ahead and create one) where documentation improvement requests can be made.

> Also, what all is wrong with the alert?

"0 inodes in use by clients, 0 stray files" these numbers are always 0.

--- Additional comment from Venky Shankar on 2023-11-07 05:52:04 UTC ---

(In reply to Patrick Donnelly from comment #1)
> For example:
> 
> sh-4.4$ ceph health detail
> HEALTH_WARN 1 MDSs report oversized cache
> [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
>     mds.ocs-storagecluster-cephfilesystem-b(mds.0): MDS cache is too large
> (6GB/4GB); 0 inodes in use by clients, 0 stray files
> sh-4.4$
> 
> The counters used are clearly not being updated anymore.

AFAIK, this happens in the standby-replay daemon. A BZ that I worked on earlier had the standby-replay daemon consuming high amount of memory and thereby emitted this warning (esp. when the active MDS did not have any oversized cache warnings).

The fix for the standby-replay trimming changes 

        https://github.com/ceph/ceph/pull/48483

aims to resolve the memory consumption.

For this BZ, do you propose to fix the counters (0 inodes, 0 stray) in the standby-replay daemon.

--- Additional comment from Patrick Donnelly on 2023-11-09 13:52:47 UTC ---

(In reply to Venky Shankar from comment #4)
> (In reply to Patrick Donnelly from comment #1)
> > For example:
> > 
> > sh-4.4$ ceph health detail
> > HEALTH_WARN 1 MDSs report oversized cache
> > [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
> >     mds.ocs-storagecluster-cephfilesystem-b(mds.0): MDS cache is too large
> > (6GB/4GB); 0 inodes in use by clients, 0 stray files
> > sh-4.4$
> > 
> > The counters used are clearly not being updated anymore.
> 
> AFAIK, this happens in the standby-replay daemon. A BZ that I worked on
> earlier had the standby-replay daemon consuming high amount of memory and
> thereby emitted this warning (esp. when the active MDS did not have any
> oversized cache warnings).
> 
> The fix for the standby-replay trimming changes 
> 
>         https://github.com/ceph/ceph/pull/48483
> 
> aims to resolve the memory consumption.

Ah, I missed that this was the standby-replay daemon. **sigh**
 
> For this BZ, do you propose to fix the counters (0 inodes, 0 stray) in the
> standby-replay daemon.

I would suggest we not output these counters for SR daemon and instead indicate in the warning that this is a SR daemon's cache.

(I feel at this point this BZ is suitable for one of our newer engineers.)

--- Additional comment from Venky Shankar on 2023-11-13 16:00:50 UTC ---

Rishabh, please clone for RHCS6 release too (z4).

--- Additional comment from Matt See on 2024-01-04 14:16:16 UTC ---

We had CU change the standby mds from standby-replay to just standby back at the end of november. That cleared the issue for a while but they just updated saying the warning was back, and that the standby mds had reverted to standby-replay. Is there a way to make this change persistent? Or a better workaround?

Comments #42 and #48 in salesforce.

--- Additional comment from Venky Shankar on 2024-04-04 17:18:25 UTC ---

Moving out of z2.

--- Additional comment from Manny on 2024-04-20 17:58:14 UTC ---

Hello Rishabh,

Serious question:  If this has no clones, how do we ensure this is resolved in future RHCS releases also?

BR
Manny

--- Additional comment from Venky Shankar on 2024-04-29 08:05:06 UTC ---

(In reply to Manny from comment #9)
> Hello Rishabh,
> 
> Serious question:  If this has no clones, how do we ensure this is resolved
> in future RHCS releases also?

Tracker https://tracker.ceph.com/issues/63514 is linked to this BZ. When the changes are merged upstream, the cephfs team does a weekly backport scrub where the relevant backport trackers are discussed and appropriate BZs are creates or cloned. Sometimes, the BZs are cloned by the developer which I think can be done in this case.

Rishabh, please do the needful.

--- Additional comment from Venky Shankar on 2024-05-27 05:36:47 UTC ---

Upstream changes are merged. Required downstream backport for 7.0 and 7.1 branch (required cloning this for 7.1).

Comment 2 Rishabh Dave 2024-06-10 11:27:46 UTC

MR has been posted - https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/641

Note You need to log in before you can comment on or make changes to this bug.