Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use Jira Cloud for all bug tracking management.

Bug 2283575

Summary: [5.3.x clone] Cache oversize warning gives 0 values for inodes in use by clients and stray files
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manny <mcaldeir>
Component: CephFSAssignee: Rishabh Dave <ridave>
Status: CLOSED WONTFIX QA Contact: Hemanth Kumar <hyelloji>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.3CC: aivaraslaimikis, ceph-eng-bugs, cephqe-warriors, gfarnum, hyelloji, mcaldeir, msee, ngangadh, pdonnell, ridave, rsachere, vshankar
Target Milestone: ---Flags: mcaldeir: needinfo-
mcaldeir: needinfo-
vshankar: needinfo? (ridave)
Target Release: 5.3z9   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2248169 Environment:
Last Closed: 2025-03-05 16:16:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2248169, 2283576    
Bug Blocks:    

Description Manny 2024-05-28 01:53:12 UTC
+++ This bug was initially created as a clone of Bug #2248169 +++

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Patrick Donnelly on 2023-11-06 17:01:57 UTC ---

For example:

sh-4.4$ ceph health detail
HEALTH_WARN 1 MDSs report oversized cache
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): MDS cache is too large (6GB/4GB); 0 inodes in use by clients, 0 stray files
sh-4.4$

The counters used are clearly not being updated anymore.

--- Additional comment from Matt See on 2023-11-06 17:21:31 UTC ---

Hey Patrick, thanks for opening up this BZ.

I'd like to eventually write a KCS for this issue, what information would you like me to gather from the customer in addition to what we've already collected? Should we fail over the mds, delete the pods, or gather more information first?

Also, what all is wrong with the alert?

--- Additional comment from Patrick Donnelly on 2023-11-06 18:21:51 UTC ---

(In reply to Matt See from comment #2)
> Hey Patrick, thanks for opening up this BZ.
> 
> I'd like to eventually write a KCS for this issue, what information would
> you like me to gather from the customer in addition to what we've already
> collected? Should we fail over the mds, delete the pods, or gather more
> information first?

I'd like to have this BZ stay focused on the issue described in the title.  Let's move that discussion to a new BZ (please go ahead and create one) where documentation improvement requests can be made.

> Also, what all is wrong with the alert?

"0 inodes in use by clients, 0 stray files" these numbers are always 0.

--- Additional comment from Venky Shankar on 2023-11-07 05:52:04 UTC ---

(In reply to Patrick Donnelly from comment #1)
> For example:
> 
> sh-4.4$ ceph health detail
> HEALTH_WARN 1 MDSs report oversized cache
> [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
>     mds.ocs-storagecluster-cephfilesystem-b(mds.0): MDS cache is too large
> (6GB/4GB); 0 inodes in use by clients, 0 stray files
> sh-4.4$
> 
> The counters used are clearly not being updated anymore.

AFAIK, this happens in the standby-replay daemon. A BZ that I worked on earlier had the standby-replay daemon consuming high amount of memory and thereby emitted this warning (esp. when the active MDS did not have any oversized cache warnings).

The fix for the standby-replay trimming changes 

        https://github.com/ceph/ceph/pull/48483

aims to resolve the memory consumption.

For this BZ, do you propose to fix the counters (0 inodes, 0 stray) in the standby-replay daemon.

--- Additional comment from Patrick Donnelly on 2023-11-09 13:52:47 UTC ---

(In reply to Venky Shankar from comment #4)
> (In reply to Patrick Donnelly from comment #1)
> > For example:
> > 
> > sh-4.4$ ceph health detail
> > HEALTH_WARN 1 MDSs report oversized cache
> > [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
> >     mds.ocs-storagecluster-cephfilesystem-b(mds.0): MDS cache is too large
> > (6GB/4GB); 0 inodes in use by clients, 0 stray files
> > sh-4.4$
> > 
> > The counters used are clearly not being updated anymore.
> 
> AFAIK, this happens in the standby-replay daemon. A BZ that I worked on
> earlier had the standby-replay daemon consuming high amount of memory and
> thereby emitted this warning (esp. when the active MDS did not have any
> oversized cache warnings).
> 
> The fix for the standby-replay trimming changes 
> 
>         https://github.com/ceph/ceph/pull/48483
> 
> aims to resolve the memory consumption.

Ah, I missed that this was the standby-replay daemon. **sigh**
 
> For this BZ, do you propose to fix the counters (0 inodes, 0 stray) in the
> standby-replay daemon.

I would suggest we not output these counters for SR daemon and instead indicate in the warning that this is a SR daemon's cache.

(I feel at this point this BZ is suitable for one of our newer engineers.)

--- Additional comment from Venky Shankar on 2023-11-13 16:00:50 UTC ---

Rishabh, please clone for RHCS6 release too (z4).

--- Additional comment from Matt See on 2024-01-04 14:16:16 UTC ---

We had CU change the standby mds from standby-replay to just standby back at the end of november. That cleared the issue for a while but they just updated saying the warning was back, and that the standby mds had reverted to standby-replay. Is there a way to make this change persistent? Or a better workaround?

Comments #42 and #48 in salesforce.

--- Additional comment from Venky Shankar on 2024-04-04 17:18:25 UTC ---

Moving out of z2.

--- Additional comment from Manny on 2024-04-20 17:58:14 UTC ---

Hello Rishabh,

Serious question:  If this has no clones, how do we ensure this is resolved in future RHCS releases also?

BR
Manny

--- Additional comment from Venky Shankar on 2024-04-29 08:05:06 UTC ---

(In reply to Manny from comment #9)
> Hello Rishabh,
> 
> Serious question:  If this has no clones, how do we ensure this is resolved
> in future RHCS releases also?

Tracker https://tracker.ceph.com/issues/63514 is linked to this BZ. When the changes are merged upstream, the cephfs team does a weekly backport scrub where the relevant backport trackers are discussed and appropriate BZs are creates or cloned. Sometimes, the BZs are cloned by the developer which I think can be done in this case.

Rishabh, please do the needful.

--- Additional comment from Venky Shankar on 2024-05-27 05:36:47 UTC ---

Upstream changes are merged. Required downstream backport for 7.0 and 7.1 branch (required cloning this for 7.1).

Comment 2 Rishabh Dave 2024-06-10 11:27:46 UTC
MR has been posted - https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/641