2252126 – 1 MDSs report oversized cache keeps reappearing

Bug 2252126 - 1 MDSs report oversized cache keeps reappearing

Summary: 1 MDSs report oversized cache keeps reappearing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	4.2
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	6.1z4
Assignee:	Patrick Donnelly
QA Contact:	Hemanth Kumar
Docs Contact:	Disha Walvekar
URL:
Whiteboard:
Duplicates (1):	2248566 (view as bug list)
Depends On:	1995906
Blocks:	2141422 2257421 2261930
TreeView+	depends on / blocked

Reported:	2023-11-29 17:02 UTC by Patrick Donnelly
Modified:	2024-03-13 10:17 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ceph-17.2.6-192.el9cp
Doc Type:	Bug Fix
Doc Text:	Cause: Standby-replay MDS daemons would not trim their caches. Consequence: MDS would run out of memory. Fix: MDS properly trims its cache when in standby-replay. Result: no OOM.
Clone Of:
Clones:	2257421 (view as bug list)
Environment:
Last Closed:	2024-02-08 18:12:27 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	63675	None	None	None	2023-11-29 17:02:10 UTC
Red Hat Issue Tracker	RHCEPH-7975	None	None	None	2023-11-29 17:03:24 UTC
Red Hat Knowledge Base (Solution)	5920011	None	None	None	2024-01-24 20:36:20 UTC
Red Hat Product Errata	RHBA-2024:0747	None	None	None	2024-02-08 18:12:40 UTC

Description Patrick Donnelly 2023-11-29 17:02:10 UTC

This bug was initially created as a copy of Bug #1995906

I am copying this bug because: 6.X series backport



+++ This bug was initially created as a clone of Bug #1986175 +++

Description of problem (please be detailed as possible and provide log
snippests):
Customer is running into the following error:
 $ cat 0070-ceph_status.txt
  cluster:
    id:     676bfd6a-a4db-4545-a8b7-fcb3babc1c89
    health: HEALTH_WARN
            1 MDSs report oversized cache

Applying the steps described in https://access.redhat.com/solutions/5920011 (mainly setting the mds_cache_trim_threshold to 256K) the problem keeps reappearing:

[root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c45469c8gzzcp
sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-a config get mds_cache_trim_threshold
{
    "mds_cache_trim_threshold": "262144"
}


[root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-f6d85c4d9trh9
sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-b config get mds_cache_trim_threshold
{
    "mds_cache_trim_threshold": "262144"
}

Version of all relevant components (if applicable):
- ocs-operator.v4.6.6
- ceph versions
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 11

Additional info:
Trying to exec into the cephfs-b pod (standby MDS) and running a dump cache fails with the following:
# oc exec -n openshift-storage <rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-pod> -- ceph daemon mds.ocs-storagecluster-cephfilesystem-b dump cache > /tmp/mds.b.dump.cache"  "error": "cache usage exceeds dump threshold"

Files from the case are located on supportshell under '/cases/02979903' (this includes a recent dump (0060-mds-report.tar.gz) from earlier this morning) and an OCS must-gather (0050-must-gather.local.6519639462001087910.tar.gz).

--- Additional comment from RHEL Program Management on 2021-07-26 20:30:49 UTC ---

This bug having no release flag set previously, is now set with release flag 'ocs‑4.8.0' to '?', and so is being proposed to be fixed at the OCS 4.8.0 release. If this bug should be proposed for a different release, please manually remove the current proposed release flag and set a new one.

Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag

--- Additional comment from Mudit Agarwal on 2021-07-29 16:43:21 UTC ---

Because the KCS was suggested as part of https://bugzilla.redhat.com/show_bug.cgi?id=1944148, moving this to rook for initial triaging.

--- Additional comment from Travis Nielsen on 2021-08-02 15:47:15 UTC ---

Patrick, can someone from cephfs take a look at this health warning?

--- Additional comment from Patrick Donnelly on 2021-08-02 18:08:58 UTC ---

Can you verify the MDS cache size?

    ceph config dump

And also the state of the file system:

    ceph fs dump

And which MDS is reporting the warning:

    ceph health detail

--- Additional comment from  on 2021-08-02 21:04:21 UTC ---

Patrick,

ceph.txt (the requested output) has been attached. The cluster is reporting HEALTH_OK atm so we can't see what MDS is reporting the warning.

--- Additional comment from  on 2021-08-02 21:04:53 UTC ---



--- Additional comment from Patrick Donnelly on 2021-08-02 21:08:00 UTC ---

If it reoccurs, please collect `ceph health detail`, `ceph fs dump`, and a perf dump of the mds `ceph daemon mds.<X> perf dump` (in a debug sidecar container).

--- Additional comment from  on 2021-08-10 23:03:39 UTC ---

Patrick, the issue has come back. Logs have been yank and are on supportshell:

-rw-rwxrw-+ 1 yank yank 14783 Aug 10 21:02 0170-mds.b.perf.dump_(new).txt
-rw-rwxrw-+ 1 yank yank  1479 Aug 10 21:02 0160-Ceph_fs_dump_(new).txt
-rw-rwxrw-+ 1 yank yank 15462 Aug 10 21:02 0150-mds.a.perf.dump_(new).txt
-rw-rwxrw-+ 1 yank yank   222 Aug 10 21:02 0140-ceph_health_detail.txt

Comment 2 Patrick Donnelly 2023-12-18 18:42:06 UTC

*** Bug 2248566 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2024-02-08 18:12:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:0747

Note You need to log in before you can comment on or make changes to this bug.