Bug 1944148 - [GSS][CephFS] health warning "MDS cache is too large (3GB/1GB); 0 inodes in use by clients, 0 stray files" for the standby-replay
Summary: [GSS][CephFS] health warning "MDS cache is too large (3GB/1GB); 0 inodes in u...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.7
Hardware: All
OS: All
high
high
Target Milestone: ---
: OCS 4.7.1
Assignee: Sébastien Han
QA Contact: Petr Balogh
URL:
Whiteboard:
Depends On:
Blocks: 1938134 1951348
TreeView+ depends on / blocked
 
Reported: 2021-03-29 12:10 UTC by Geo Jose
Modified: 2021-10-01 03:56 UTC (History)
17 users (show)

Fixed In Version: 4.7.1-403.ci
Doc Type: Bug Fix
Doc Text:
Previously, Rook did not apply `mds_cache_memory_limit` upon upgrades. This means OpenShift Container Storage 4.2 clusters that did not have that option applied were not updated with the correct value, which is typically half the size of the pod's memory limit. Therefore, MDSs in standby-replay may report oversized cache.
Clone Of:
: 1951348 (view as bug list)
Environment:
Last Closed: 2021-06-15 16:50:37 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift rook pull 223 0 None open Bug 1944148: ceph: always apply config flags for mds and rgw 2021-04-21 08:16:48 UTC
Github red-hat-storage ocs-ci pull 4453 0 None closed Testcase to check mds cache memory limit post ocs upgrade 2021-06-30 06:02:54 UTC
Github rook rook pull 7681 0 None open ceph: always apply config flags for mds and rgw 2021-04-19 14:12:59 UTC
Red Hat Knowledge Base (Solution) 5920011 0 None None None 2021-03-31 16:50:34 UTC
Red Hat Product Errata RHBA-2021:2449 0 None None None 2021-06-15 16:50:53 UTC

Internal Links: 2063374

Comment 1 RHEL Program Management 2021-03-29 12:10:48 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 14 Sébastien Han 2021-04-13 14:27:27 UTC
Patrick, if mds_cache_memory_limit is 1GB, this means that the pod memory limit is 2GB.
As far as I can tell the ocs-op resources are set to 8GB back to release-4.2 https://github.com/openshift/ocs-operator/blob/release-4.2/pkg/controller/defaults/resources.go

So I'm not sure why that pod got such low memory allocated.
Rook simply looks up the memory limit and applies a 50% ratio to it.

Here are some notes from our code:

	// MDS cache memory limit should be set to 50-60% of RAM reserved for the MDS container
	// MDS uses approximately 125% of the value of mds_cache_memory_limit in RAM.
	// Eventually we will tune this automatically: http://tracker.ceph.com/issues/36663

Comment 19 Sébastien Han 2021-04-14 07:01:10 UTC
mds_cache_memory_limit should be in the "ceph config dump" output.
1GB seems to be the default value of mds_cache_memory_limit.

Could you look at the audit logs (from the mons) and grep for "mds_cache_memory_limit", I don't know why but it seems that the mds_cache_memory_limit was removed.
Thanks.

Comment 30 Sébastien Han 2021-04-21 08:18:15 UTC
Mudit, looks like the doc text is filled already.

Comment 31 Mudit Agarwal 2021-04-21 08:24:18 UTC
(In reply to Sébastien Han from comment #30)
> Mudit, looks like the doc text is filled already.

That was filled by ceph folks when the initial issue was reported. 
They fixed it so the doc text type was "Bug Fix", but now this is a rook issue and we have decided not to fix it in 4.7 so we should provide doc text as "Known issue"

If the existing doc text is still relevant then its ok but it talks about the ceph fix.

Comment 33 Sébastien Han 2021-05-24 15:36:37 UTC
Doc text was updated so removing my needinfo

Comment 34 Travis Nielsen 2021-05-25 21:19:27 UTC
Merged: https://github.com/openshift/rook/pull/223

Comment 43 Sébastien Han 2021-06-02 16:18:18 UTC
Mudit, I edited the doc_text.

Comment 44 Petr Balogh 2021-06-03 19:40:24 UTC
Went with OCP 4.3 - OCS 4.2 as initial deployment and continued by upgrading one by one version of OCS and OCP.
When I was on OCS 4.6.4 I upgraded OCP to 4.7 and then OCS directly to 4.7.1-403.ci internal build.

$ oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES                        PHASE
lib-bucket-provisioner.v2.0.0   lib-bucket-provisioner        2.0.0          lib-bucket-provisioner.v1.0.0   Succeeded
ocs-operator.v4.7.1-403.ci      OpenShift Container Storage   4.7.1-403.ci   ocs-operator.v4.6.4             Succeeded

oc rsh -n openshift-storage rook-ceph-tools-784547f7c7-qxfz7
sh-4.4# ceph config dump|grep mds_cache_memory_limit
    mds.ocs-storagecluster-cephfilesystem-a              basic    mds_cache_memory_limit             4294967296
    mds.ocs-storagecluster-cephfilesystem-b              basic    mds_cache_memory_limit             4294967296

So looks OK and will mark as verified.

Comment 49 errata-xmlrpc 2021-06-15 16:50:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2449


Note You need to log in before you can comment on or make changes to this bug.