+++ This bug was initially created as a clone of Bug #2252126 +++ This bug was initially created as a copy of Bug #1995906 I am copying this bug because: 6.X series backport +++ This bug was initially created as a clone of Bug #1986175 +++ Description of problem (please be detailed as possible and provide log snippests): Customer is running into the following error: $ cat 0070-ceph_status.txt cluster: id: 676bfd6a-a4db-4545-a8b7-fcb3babc1c89 health: HEALTH_WARN 1 MDSs report oversized cache Applying the steps described in https://access.redhat.com/solutions/5920011 (mainly setting the mds_cache_trim_threshold to 256K) the problem keeps reappearing: [root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c45469c8gzzcp sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-a config get mds_cache_trim_threshold { "mds_cache_trim_threshold": "262144" } [root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-f6d85c4d9trh9 sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-b config get mds_cache_trim_threshold { "mds_cache_trim_threshold": "262144" } Version of all relevant components (if applicable): - ocs-operator.v4.6.6 - ceph versions "mon": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2 }, "rgw": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 11 Additional info: Trying to exec into the cephfs-b pod (standby MDS) and running a dump cache fails with the following: # oc exec -n openshift-storage <rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-pod> -- ceph daemon mds.ocs-storagecluster-cephfilesystem-b dump cache > /tmp/mds.b.dump.cache" "error": "cache usage exceeds dump threshold" Files from the case are located on supportshell under '/cases/02979903' (this includes a recent dump (0060-mds-report.tar.gz) from earlier this morning) and an OCS must-gather (0050-must-gather.local.6519639462001087910.tar.gz). --- Additional comment from RHEL Program Management on 2021-07-26 20:30:49 UTC --- This bug having no release flag set previously, is now set with release flag 'ocs‑4.8.0' to '?', and so is being proposed to be fixed at the OCS 4.8.0 release. If this bug should be proposed for a different release, please manually remove the current proposed release flag and set a new one. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag --- Additional comment from Mudit Agarwal on 2021-07-29 16:43:21 UTC --- Because the KCS was suggested as part of https://bugzilla.redhat.com/show_bug.cgi?id=1944148, moving this to rook for initial triaging. --- Additional comment from Travis Nielsen on 2021-08-02 15:47:15 UTC --- Patrick, can someone from cephfs take a look at this health warning? --- Additional comment from Patrick Donnelly on 2021-08-02 18:08:58 UTC --- Can you verify the MDS cache size? ceph config dump And also the state of the file system: ceph fs dump And which MDS is reporting the warning: ceph health detail --- Additional comment from on 2021-08-02 21:04:21 UTC --- Patrick, ceph.txt (the requested output) has been attached. The cluster is reporting HEALTH_OK atm so we can't see what MDS is reporting the warning. --- Additional comment from on 2021-08-02 21:04:53 UTC --- --- Additional comment from Patrick Donnelly on 2021-08-02 21:08:00 UTC --- If it reoccurs, please collect `ceph health detail`, `ceph fs dump`, and a perf dump of the mds `ceph daemon mds.<X> perf dump` (in a debug sidecar container). --- Additional comment from on 2021-08-10 23:03:39 UTC --- Patrick, the issue has come back. Logs have been yank and are on supportshell: -rw-rwxrw-+ 1 yank yank 14783 Aug 10 21:02 0170-mds.b.perf.dump_(new).txt -rw-rwxrw-+ 1 yank yank 1479 Aug 10 21:02 0160-Ceph_fs_dump_(new).txt -rw-rwxrw-+ 1 yank yank 15462 Aug 10 21:02 0150-mds.a.perf.dump_(new).txt -rw-rwxrw-+ 1 yank yank 222 Aug 10 21:02 0140-ceph_health_detail.txt --- Additional comment from Patrick Donnelly on 2023-11-29 22:34:31 IST --- https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/435 --- Additional comment from Patrick Donnelly on 2023-12-19 00:12:06 IST ---
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 Security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:0745