+++ This bug was initially created as a clone of Bug #2248169 +++ Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: --- Additional comment from Patrick Donnelly on 2023-11-06 17:01:57 UTC --- For example: sh-4.4$ ceph health detail HEALTH_WARN 1 MDSs report oversized cache [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.ocs-storagecluster-cephfilesystem-b(mds.0): MDS cache is too large (6GB/4GB); 0 inodes in use by clients, 0 stray files sh-4.4$ The counters used are clearly not being updated anymore. --- Additional comment from Matt See on 2023-11-06 17:21:31 UTC --- Hey Patrick, thanks for opening up this BZ. I'd like to eventually write a KCS for this issue, what information would you like me to gather from the customer in addition to what we've already collected? Should we fail over the mds, delete the pods, or gather more information first? Also, what all is wrong with the alert? --- Additional comment from Patrick Donnelly on 2023-11-06 18:21:51 UTC --- (In reply to Matt See from comment #2) > Hey Patrick, thanks for opening up this BZ. > > I'd like to eventually write a KCS for this issue, what information would > you like me to gather from the customer in addition to what we've already > collected? Should we fail over the mds, delete the pods, or gather more > information first? I'd like to have this BZ stay focused on the issue described in the title. Let's move that discussion to a new BZ (please go ahead and create one) where documentation improvement requests can be made. > Also, what all is wrong with the alert? "0 inodes in use by clients, 0 stray files" these numbers are always 0. --- Additional comment from Venky Shankar on 2023-11-07 05:52:04 UTC --- (In reply to Patrick Donnelly from comment #1) > For example: > > sh-4.4$ ceph health detail > HEALTH_WARN 1 MDSs report oversized cache > [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache > mds.ocs-storagecluster-cephfilesystem-b(mds.0): MDS cache is too large > (6GB/4GB); 0 inodes in use by clients, 0 stray files > sh-4.4$ > > The counters used are clearly not being updated anymore. AFAIK, this happens in the standby-replay daemon. A BZ that I worked on earlier had the standby-replay daemon consuming high amount of memory and thereby emitted this warning (esp. when the active MDS did not have any oversized cache warnings). The fix for the standby-replay trimming changes https://github.com/ceph/ceph/pull/48483 aims to resolve the memory consumption. For this BZ, do you propose to fix the counters (0 inodes, 0 stray) in the standby-replay daemon. --- Additional comment from Patrick Donnelly on 2023-11-09 13:52:47 UTC --- (In reply to Venky Shankar from comment #4) > (In reply to Patrick Donnelly from comment #1) > > For example: > > > > sh-4.4$ ceph health detail > > HEALTH_WARN 1 MDSs report oversized cache > > [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache > > mds.ocs-storagecluster-cephfilesystem-b(mds.0): MDS cache is too large > > (6GB/4GB); 0 inodes in use by clients, 0 stray files > > sh-4.4$ > > > > The counters used are clearly not being updated anymore. > > AFAIK, this happens in the standby-replay daemon. A BZ that I worked on > earlier had the standby-replay daemon consuming high amount of memory and > thereby emitted this warning (esp. when the active MDS did not have any > oversized cache warnings). > > The fix for the standby-replay trimming changes > > https://github.com/ceph/ceph/pull/48483 > > aims to resolve the memory consumption. Ah, I missed that this was the standby-replay daemon. **sigh** > For this BZ, do you propose to fix the counters (0 inodes, 0 stray) in the > standby-replay daemon. I would suggest we not output these counters for SR daemon and instead indicate in the warning that this is a SR daemon's cache. (I feel at this point this BZ is suitable for one of our newer engineers.) --- Additional comment from Venky Shankar on 2023-11-13 16:00:50 UTC --- Rishabh, please clone for RHCS6 release too (z4). --- Additional comment from Matt See on 2024-01-04 14:16:16 UTC --- We had CU change the standby mds from standby-replay to just standby back at the end of november. That cleared the issue for a while but they just updated saying the warning was back, and that the standby mds had reverted to standby-replay. Is there a way to make this change persistent? Or a better workaround? Comments #42 and #48 in salesforce. --- Additional comment from Venky Shankar on 2024-04-04 17:18:25 UTC --- Moving out of z2. --- Additional comment from Manny on 2024-04-20 17:58:14 UTC --- Hello Rishabh, Serious question: If this has no clones, how do we ensure this is resolved in future RHCS releases also? BR Manny --- Additional comment from Venky Shankar on 2024-04-29 08:05:06 UTC --- (In reply to Manny from comment #9) > Hello Rishabh, > > Serious question: If this has no clones, how do we ensure this is resolved > in future RHCS releases also? Tracker https://tracker.ceph.com/issues/63514 is linked to this BZ. When the changes are merged upstream, the cephfs team does a weekly backport scrub where the relevant backport trackers are discussed and appropriate BZs are creates or cloned. Sometimes, the BZs are cloned by the developer which I think can be done in this case. Rishabh, please do the needful. --- Additional comment from Venky Shankar on 2024-05-27 05:36:47 UTC --- Upstream changes are merged. Required downstream backport for 7.0 and 7.1 branch (required cloning this for 7.1).
MR has been posted - https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/641