Bug 1998166 - 1 MDSs report oversized cache keeps reappearing
Summary: 1 MDSs report oversized cache keeps reappearing
Keywords:
Status: CLOSED DUPLICATE of bug 1995906
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 4.2
Hardware: All
OS: All
unspecified
medium
Target Milestone: ---
: 5.1
Assignee: Patrick Donnelly
QA Contact: Hemanth Kumar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-26 13:54 UTC by Patrick Donnelly
Modified: 2021-08-26 13:59 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-26 13:58:46 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 48673 0 None None None 2021-08-26 13:54:21 UTC
Red Hat Issue Tracker RHCEPH-962 0 None None None 2021-08-26 13:56:02 UTC

Description Patrick Donnelly 2021-08-26 13:54:22 UTC
This bug was initially created as a copy of Bug #1986175

I am copying this bug because: 

standby-replay bug with memory usage


Description of problem (please be detailed as possible and provide log
snippests):
Customer is running into the following error:
 $ cat 0070-ceph_status.txt
  cluster:
    id:     676bfd6a-a4db-4545-a8b7-fcb3babc1c89
    health: HEALTH_WARN
            1 MDSs report oversized cache

Applying the steps described in https://access.redhat.com/solutions/5920011 (mainly setting the mds_cache_trim_threshold to 256K) the problem keeps reappearing:

[root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c45469c8gzzcp
sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-a config get mds_cache_trim_threshold
{
    "mds_cache_trim_threshold": "262144"
}


[root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-f6d85c4d9trh9
sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-b config get mds_cache_trim_threshold
{
    "mds_cache_trim_threshold": "262144"
}

Version of all relevant components (if applicable):
- ocs-operator.v4.6.6
- ceph versions
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 11

Additional info:
Trying to exec into the cephfs-b pod (standby MDS) and running a dump cache fails with the following:
# oc exec -n openshift-storage <rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-pod> -- ceph daemon mds.ocs-storagecluster-cephfilesystem-b dump cache > /tmp/mds.b.dump.cache"  "error": "cache usage exceeds dump threshold"

Files from the case are located on supportshell under '/cases/02979903' (this includes a recent dump (0060-mds-report.tar.gz) from earlier this morning) and an OCS must-gather (0050-must-gather.local.6519639462001087910.tar.gz).

Comment 1 Patrick Donnelly 2021-08-26 13:58:46 UTC

*** This bug has been marked as a duplicate of bug 1995906 ***


Note You need to log in before you can comment on or make changes to this bug.