1944148 – [GSS][CephFS] health warning "MDS cache is too large (3GB/1GB); 0 inodes in use by clients, 0 stray files" for the standby-replay

Bug 1944148 - [GSS][CephFS] health warning "MDS cache is too large (3GB/1GB); 0 inodes in use by clients, 0 stray files" for the standby-replay

Summary: [GSS][CephFS] health warning "MDS cache is too large (3GB/1GB); 0 inodes in u...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.1
Assignee:	Sébastien Han
QA Contact:	Petr Balogh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1938134 1951348
TreeView+	depends on / blocked

Reported:	2021-03-29 12:10 UTC by Geo Jose
Modified:	2024-10-01 17:48 UTC (History)
CC List:	17 users (show)
Fixed In Version:	4.7.1-403.ci
Doc Type:	Bug Fix
Doc Text:	Previously, Rook did not apply `mds_cache_memory_limit` upon upgrades. This means OpenShift Container Storage 4.2 clusters that did not have that option applied were not updated with the correct value, which is typically half the size of the pod's memory limit. Therefore, MDSs in standby-replay may report oversized cache.
Clone Of:
Clones:	1951348 (view as bug list)
Environment:
Last Closed:	2021-06-15 16:50:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 223	None	open	Bug 1944148: ceph: always apply config flags for mds and rgw	2021-04-21 08:16:48 UTC
Github	red-hat-storage ocs-ci pull 4453	None	closed	Testcase to check mds cache memory limit post ocs upgrade	2021-06-30 06:02:54 UTC
Github	rook rook pull 7681	None	open	ceph: always apply config flags for mds and rgw	2021-04-19 14:12:59 UTC
Red Hat Knowledge Base (Solution)	5920011	None	None	None	2021-03-31 16:50:34 UTC
Red Hat Product Errata	RHBA-2021:2449	None	None	None	2021-06-15 16:50:53 UTC

Internal Links: 2063374

Comment 1 RHEL Program Management 2021-03-29 12:10:48 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 14 Sébastien Han 2021-04-13 14:27:27 UTC

Patrick, if mds_cache_memory_limit is 1GB, this means that the pod memory limit is 2GB.
As far as I can tell the ocs-op resources are set to 8GB back to release-4.2 https://github.com/openshift/ocs-operator/blob/release-4.2/pkg/controller/defaults/resources.go

So I'm not sure why that pod got such low memory allocated.
Rook simply looks up the memory limit and applies a 50% ratio to it.

Here are some notes from our code:

	// MDS cache memory limit should be set to 50-60% of RAM reserved for the MDS container
	// MDS uses approximately 125% of the value of mds_cache_memory_limit in RAM.
	// Eventually we will tune this automatically: http://tracker.ceph.com/issues/36663

Comment 19 Sébastien Han 2021-04-14 07:01:10 UTC

mds_cache_memory_limit should be in the "ceph config dump" output.
1GB seems to be the default value of mds_cache_memory_limit.

Could you look at the audit logs (from the mons) and grep for "mds_cache_memory_limit", I don't know why but it seems that the mds_cache_memory_limit was removed.
Thanks.

Comment 30 Sébastien Han 2021-04-21 08:18:15 UTC

Mudit, looks like the doc text is filled already.

Comment 31 Mudit Agarwal 2021-04-21 08:24:18 UTC

(In reply to Sébastien Han from comment #30)
> Mudit, looks like the doc text is filled already.

That was filled by ceph folks when the initial issue was reported. 
They fixed it so the doc text type was "Bug Fix", but now this is a rook issue and we have decided not to fix it in 4.7 so we should provide doc text as "Known issue"

If the existing doc text is still relevant then its ok but it talks about the ceph fix.

Comment 33 Sébastien Han 2021-05-24 15:36:37 UTC

Doc text was updated so removing my needinfo

Comment 34 Travis Nielsen 2021-05-25 21:19:27 UTC

Merged: https://github.com/openshift/rook/pull/223

Comment 43 Sébastien Han 2021-06-02 16:18:18 UTC

Mudit, I edited the doc_text.

Comment 44 Petr Balogh 2021-06-03 19:40:24 UTC

Went with OCP 4.3 - OCS 4.2 as initial deployment and continued by upgrading one by one version of OCS and OCP.
When I was on OCS 4.6.4 I upgraded OCP to 4.7 and then OCS directly to 4.7.1-403.ci internal build.

$ oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES                        PHASE
lib-bucket-provisioner.v2.0.0   lib-bucket-provisioner        2.0.0          lib-bucket-provisioner.v1.0.0   Succeeded
ocs-operator.v4.7.1-403.ci      OpenShift Container Storage   4.7.1-403.ci   ocs-operator.v4.6.4             Succeeded

oc rsh -n openshift-storage rook-ceph-tools-784547f7c7-qxfz7
sh-4.4# ceph config dump|grep mds_cache_memory_limit
    mds.ocs-storagecluster-cephfilesystem-a              basic    mds_cache_memory_limit             4294967296
    mds.ocs-storagecluster-cephfilesystem-b              basic    mds_cache_memory_limit             4294967296

So looks OK and will mark as verified.

Comment 49 errata-xmlrpc 2021-06-15 16:50:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2449

Note You need to log in before you can comment on or make changes to this bug.