2257421 – [5.3 backport] 1 MDSs report oversized cache keeps reappearing

Bug 2257421 - [5.3 backport] 1 MDSs report oversized cache keeps reappearing

Summary: [5.3 backport] 1 MDSs report oversized cache keeps reappearing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	4.2
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.3z6
Assignee:	Venky Shankar
QA Contact:	Hemanth Kumar
Docs Contact:	Ranjini M N
URL:
Whiteboard:
Depends On:	1995906 2252126
Blocks:
TreeView+	depends on / blocked

Reported:	2024-01-09 14:40 UTC by Bipin Kunal
Modified:	2024-04-18 08:27 UTC (History)
CC List:	13 users (show)
Fixed In Version:	ceph-16.2.10-248.el8cp
Doc Type:	Bug Fix
Doc Text:	.The standby-replay MDS daemons now trim their caches Previously, the standby-replay MDS daemon would retain more metadata in its cache than required, thereby reporting an oversized cache warning. This caused a persistent “MDSs report oversized cache” warning in the storage cluster when the standby-replay MDS daemons were used. With this fix, the standby-replay MDS daemons trim their caches and keep the cache usage below the configured limit and no “MDSs report oversized cache” warnings are emitted.
Clone Of:	2252126
Environment:
Last Closed:	2024-02-08 16:49:26 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	48673	None	None	None	2024-01-17 15:56:22 UTC
Red Hat Issue Tracker	RHCEPH-8148	None	None	None	2024-01-09 14:44:19 UTC
Red Hat Knowledge Base (Solution)	5920011	None	None	None	2024-01-24 20:36:05 UTC
Red Hat Product Errata	RHSA-2024:0745	None	None	None	2024-02-08 16:49:29 UTC

Description Bipin Kunal 2024-01-09 14:40:59 UTC

+++ This bug was initially created as a clone of Bug #2252126 +++

This bug was initially created as a copy of Bug #1995906

I am copying this bug because: 6.X series backport



+++ This bug was initially created as a clone of Bug #1986175 +++

Description of problem (please be detailed as possible and provide log
snippests):
Customer is running into the following error:
 $ cat 0070-ceph_status.txt
  cluster:
    id:     676bfd6a-a4db-4545-a8b7-fcb3babc1c89
    health: HEALTH_WARN
            1 MDSs report oversized cache

Applying the steps described in https://access.redhat.com/solutions/5920011 (mainly setting the mds_cache_trim_threshold to 256K) the problem keeps reappearing:

[root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c45469c8gzzcp
sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-a config get mds_cache_trim_threshold
{
    "mds_cache_trim_threshold": "262144"
}


[root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-f6d85c4d9trh9
sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-b config get mds_cache_trim_threshold
{
    "mds_cache_trim_threshold": "262144"
}

Version of all relevant components (if applicable):
- ocs-operator.v4.6.6
- ceph versions
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 11

Additional info:
Trying to exec into the cephfs-b pod (standby MDS) and running a dump cache fails with the following:
# oc exec -n openshift-storage <rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-pod> -- ceph daemon mds.ocs-storagecluster-cephfilesystem-b dump cache > /tmp/mds.b.dump.cache"  "error": "cache usage exceeds dump threshold"

Files from the case are located on supportshell under '/cases/02979903' (this includes a recent dump (0060-mds-report.tar.gz) from earlier this morning) and an OCS must-gather (0050-must-gather.local.6519639462001087910.tar.gz).

--- Additional comment from RHEL Program Management on 2021-07-26 20:30:49 UTC ---

This bug having no release flag set previously, is now set with release flag 'ocs‑4.8.0' to '?', and so is being proposed to be fixed at the OCS 4.8.0 release. If this bug should be proposed for a different release, please manually remove the current proposed release flag and set a new one.

Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag

--- Additional comment from Mudit Agarwal on 2021-07-29 16:43:21 UTC ---

Because the KCS was suggested as part of https://bugzilla.redhat.com/show_bug.cgi?id=1944148, moving this to rook for initial triaging.

--- Additional comment from Travis Nielsen on 2021-08-02 15:47:15 UTC ---

Patrick, can someone from cephfs take a look at this health warning?

--- Additional comment from Patrick Donnelly on 2021-08-02 18:08:58 UTC ---

Can you verify the MDS cache size?

    ceph config dump

And also the state of the file system:

    ceph fs dump

And which MDS is reporting the warning:

    ceph health detail

--- Additional comment from  on 2021-08-02 21:04:21 UTC ---

Patrick,

ceph.txt (the requested output) has been attached. The cluster is reporting HEALTH_OK atm so we can't see what MDS is reporting the warning.

--- Additional comment from  on 2021-08-02 21:04:53 UTC ---



--- Additional comment from Patrick Donnelly on 2021-08-02 21:08:00 UTC ---

If it reoccurs, please collect `ceph health detail`, `ceph fs dump`, and a perf dump of the mds `ceph daemon mds.<X> perf dump` (in a debug sidecar container).

--- Additional comment from  on 2021-08-10 23:03:39 UTC ---

Patrick, the issue has come back. Logs have been yank and are on supportshell:

-rw-rwxrw-+ 1 yank yank 14783 Aug 10 21:02 0170-mds.b.perf.dump_(new).txt
-rw-rwxrw-+ 1 yank yank  1479 Aug 10 21:02 0160-Ceph_fs_dump_(new).txt
-rw-rwxrw-+ 1 yank yank 15462 Aug 10 21:02 0150-mds.a.perf.dump_(new).txt
-rw-rwxrw-+ 1 yank yank   222 Aug 10 21:02 0140-ceph_health_detail.txt

--- Additional comment from Patrick Donnelly on 2023-11-29 22:34:31 IST ---

https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/435

--- Additional comment from Patrick Donnelly on 2023-12-19 00:12:06 IST ---

Comment 17 errata-xmlrpc 2024-02-08 16:49:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 Security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:0745

Note You need to log in before you can comment on or make changes to this bug.