1998166 – 1 MDSs report oversized cache keeps reappearing

Bug 1998166 - 1 MDSs report oversized cache keeps reappearing

Summary: 1 MDSs report oversized cache keeps reappearing

Keywords:
Status:	CLOSED DUPLICATE of bug 1995906
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	4.2
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	5.1
Assignee:	Patrick Donnelly
QA Contact:	Hemanth Kumar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-26 13:54 UTC by Patrick Donnelly
Modified:	2021-08-26 13:59 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-26 13:58:46 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	48673	0	None	None	None	2021-08-26 13:54:21 UTC
Red Hat Issue Tracker	RHCEPH-962	0	None	None	None	2021-08-26 13:56:02 UTC

Description Patrick Donnelly 2021-08-26 13:54:22 UTC

This bug was initially created as a copy of Bug #1986175

I am copying this bug because: 

standby-replay bug with memory usage


Description of problem (please be detailed as possible and provide log
snippests):
Customer is running into the following error:
 $ cat 0070-ceph_status.txt
  cluster:
    id:     676bfd6a-a4db-4545-a8b7-fcb3babc1c89
    health: HEALTH_WARN
            1 MDSs report oversized cache

Applying the steps described in https://access.redhat.com/solutions/5920011 (mainly setting the mds_cache_trim_threshold to 256K) the problem keeps reappearing:

[root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c45469c8gzzcp
sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-a config get mds_cache_trim_threshold
{
    "mds_cache_trim_threshold": "262144"
}


[root@ocpbspapp1 ~]# oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-f6d85c4d9trh9
sh-4.4# ceph daemon mds.ocs-storagecluster-cephfilesystem-b config get mds_cache_trim_threshold
{
    "mds_cache_trim_threshold": "262144"
}

Version of all relevant components (if applicable):
- ocs-operator.v4.6.6
- ceph versions
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 11

Additional info:
Trying to exec into the cephfs-b pod (standby MDS) and running a dump cache fails with the following:
# oc exec -n openshift-storage <rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-pod> -- ceph daemon mds.ocs-storagecluster-cephfilesystem-b dump cache > /tmp/mds.b.dump.cache"  "error": "cache usage exceeds dump threshold"

Files from the case are located on supportshell under '/cases/02979903' (this includes a recent dump (0060-mds-report.tar.gz) from earlier this morning) and an OCS must-gather (0050-must-gather.local.6519639462001087910.tar.gz).

Comment 1 Patrick Donnelly 2021-08-26 13:58:46 UTC


*** This bug has been marked as a duplicate of bug 1995906 ***

Note You need to log in before you can comment on or make changes to this bug.