Bug 2252126

Summary:	1 MDSs report oversized cache keeps reappearing
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Patrick Donnelly <pdonnell>
Component:	CephFS	Assignee:	Patrick Donnelly <pdonnell>
Status:	CLOSED ERRATA	QA Contact:	Hemanth Kumar <hyelloji>
Severity:	high	Docs Contact:	Disha Walvekar <dwalveka>
Priority:	unspecified
Version:	4.2	CC:	bkunal, ceph-eng-bugs, cephqe-warriors, dwalveka, gfarnum, kbg, mcaldeir, ngangadh, tserlin, vshankar
Target Milestone:	---
Target Release:	6.1z4
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	ceph-17.2.6-192.el9cp	Doc Type:	Bug Fix
Doc Text:	Cause: Standby-replay MDS daemons would not trim their caches. Consequence: MDS would run out of memory. Fix: MDS properly trims its cache when in standby-replay. Result: no OOM.	Story Points:	---
Clone Of:
Clones:	2257421 (view as bug list)		Environment:
Last Closed:	2024-02-08 18:12:27 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1995906
Bug Blocks:	2141422, 2257421, 2261930

Description Patrick Donnelly 2023-11-29 17:02:10 UTC

This bug was initially created as a copy of Bug #1995906

I am copying this bug because: 6.X series backport



+++ This bug was initially created as a clone of Bug #1986175 +++

Description of problem (please be detailed as possible and provide log
snippests):
Customer is running into the following error:
 $ cat 0070-ceph_status.txt
  cluster:
    id:     676bfd6a-a4db-4545-a8b7-fcb3babc1c89
    health: HEALTH_WARN
            1 MDSs report oversized cache

Applying the steps described in https://access.redhat.com/solutions/5920011 (mainly setting the mds_cache_trim_threshold to 256K) the problem keeps reappearing:

[root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c45469c8gzzcp
sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-a config get mds_cache_trim_threshold
{
    "mds_cache_trim_threshold": "262144"
}


[root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-f6d85c4d9trh9
sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-b config get mds_cache_trim_threshold
{
    "mds_cache_trim_threshold": "262144"
}

Version of all relevant components (if applicable):
- ocs-operator.v4.6.6
- ceph versions
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 11

Additional info:
Trying to exec into the cephfs-b pod (standby MDS) and running a dump cache fails with the following:
# oc exec -n openshift-storage <rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-pod> -- ceph daemon mds.ocs-storagecluster-cephfilesystem-b dump cache > /tmp/mds.b.dump.cache"  "error": "cache usage exceeds dump threshold"

Files from the case are located on supportshell under '/cases/02979903' (this includes a recent dump (0060-mds-report.tar.gz) from earlier this morning) and an OCS must-gather (0050-must-gather.local.6519639462001087910.tar.gz).

--- Additional comment from RHEL Program Management on 2021-07-26 20:30:49 UTC ---

This bug having no release flag set previously, is now set with release flag 'ocs‑4.8.0' to '?', and so is being proposed to be fixed at the OCS 4.8.0 release. If this bug should be proposed for a different release, please manually remove the current proposed release flag and set a new one.

Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag

--- Additional comment from Mudit Agarwal on 2021-07-29 16:43:21 UTC ---

Because the KCS was suggested as part of https://bugzilla.redhat.com/show_bug.cgi?id=1944148, moving this to rook for initial triaging.

--- Additional comment from Travis Nielsen on 2021-08-02 15:47:15 UTC ---

Patrick, can someone from cephfs take a look at this health warning?

--- Additional comment from Patrick Donnelly on 2021-08-02 18:08:58 UTC ---

Can you verify the MDS cache size?

    ceph config dump

And also the state of the file system:

    ceph fs dump

And which MDS is reporting the warning:

    ceph health detail

--- Additional comment from  on 2021-08-02 21:04:21 UTC ---

Patrick,

ceph.txt (the requested output) has been attached. The cluster is reporting HEALTH_OK atm so we can't see what MDS is reporting the warning.

--- Additional comment from  on 2021-08-02 21:04:53 UTC ---



--- Additional comment from Patrick Donnelly on 2021-08-02 21:08:00 UTC ---

If it reoccurs, please collect `ceph health detail`, `ceph fs dump`, and a perf dump of the mds `ceph daemon mds.<X> perf dump` (in a debug sidecar container).

--- Additional comment from  on 2021-08-10 23:03:39 UTC ---

Patrick, the issue has come back. Logs have been yank and are on supportshell:

-rw-rwxrw-+ 1 yank yank 14783 Aug 10 21:02 0170-mds.b.perf.dump_(new).txt
-rw-rwxrw-+ 1 yank yank  1479 Aug 10 21:02 0160-Ceph_fs_dump_(new).txt
-rw-rwxrw-+ 1 yank yank 15462 Aug 10 21:02 0150-mds.a.perf.dump_(new).txt
-rw-rwxrw-+ 1 yank yank   222 Aug 10 21:02 0140-ceph_health_detail.txt

Comment 2 Patrick Donnelly 2023-12-18 18:42:06 UTC

*** Bug 2248566 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2024-02-08 18:12:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:0747