Bug 1944148

Summary: [GSS][CephFS] health warning "MDS cache is too large (3GB/1GB); 0 inodes in use by clients, 0 stray files" for the standby-replay
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Geo Jose <gjose>
Component: rookAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Petr Balogh <pbalogh>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: asriram, bkunal, ceph-eng-bugs, edonnell, etamir, madam, muagarwa, musoni, nravinas, ocs-bugs, pbalogh, pdonnell, shan, sweil, tdesala, tnielsen, vereddy
Target Milestone: ---Keywords: AutomationBackLog, ZStream
Target Release: OCS 4.7.1   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: 4.7.1-403.ci Doc Type: Bug Fix
Doc Text:
Previously, Rook did not apply `mds_cache_memory_limit` upon upgrades. This means OpenShift Container Storage 4.2 clusters that did not have that option applied were not updated with the correct value, which is typically half the size of the pod's memory limit. Therefore, MDSs in standby-replay may report oversized cache.
Story Points: ---
Clone Of:
: 1951348 (view as bug list) Environment:
Last Closed: 2021-06-15 16:50:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1938134, 1951348    

Comment 1 RHEL Program Management 2021-03-29 12:10:48 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 14 Sébastien Han 2021-04-13 14:27:27 UTC
Patrick, if mds_cache_memory_limit is 1GB, this means that the pod memory limit is 2GB.
As far as I can tell the ocs-op resources are set to 8GB back to release-4.2 https://github.com/openshift/ocs-operator/blob/release-4.2/pkg/controller/defaults/resources.go

So I'm not sure why that pod got such low memory allocated.
Rook simply looks up the memory limit and applies a 50% ratio to it.

Here are some notes from our code:

	// MDS cache memory limit should be set to 50-60% of RAM reserved for the MDS container
	// MDS uses approximately 125% of the value of mds_cache_memory_limit in RAM.
	// Eventually we will tune this automatically: http://tracker.ceph.com/issues/36663

Comment 19 Sébastien Han 2021-04-14 07:01:10 UTC
mds_cache_memory_limit should be in the "ceph config dump" output.
1GB seems to be the default value of mds_cache_memory_limit.

Could you look at the audit logs (from the mons) and grep for "mds_cache_memory_limit", I don't know why but it seems that the mds_cache_memory_limit was removed.
Thanks.

Comment 30 Sébastien Han 2021-04-21 08:18:15 UTC
Mudit, looks like the doc text is filled already.

Comment 31 Mudit Agarwal 2021-04-21 08:24:18 UTC
(In reply to Sébastien Han from comment #30)
> Mudit, looks like the doc text is filled already.

That was filled by ceph folks when the initial issue was reported. 
They fixed it so the doc text type was "Bug Fix", but now this is a rook issue and we have decided not to fix it in 4.7 so we should provide doc text as "Known issue"

If the existing doc text is still relevant then its ok but it talks about the ceph fix.

Comment 33 Sébastien Han 2021-05-24 15:36:37 UTC
Doc text was updated so removing my needinfo

Comment 34 Travis Nielsen 2021-05-25 21:19:27 UTC
Merged: https://github.com/openshift/rook/pull/223

Comment 43 Sébastien Han 2021-06-02 16:18:18 UTC
Mudit, I edited the doc_text.

Comment 44 Petr Balogh 2021-06-03 19:40:24 UTC
Went with OCP 4.3 - OCS 4.2 as initial deployment and continued by upgrading one by one version of OCS and OCP.
When I was on OCS 4.6.4 I upgraded OCP to 4.7 and then OCS directly to 4.7.1-403.ci internal build.

$ oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES                        PHASE
lib-bucket-provisioner.v2.0.0   lib-bucket-provisioner        2.0.0          lib-bucket-provisioner.v1.0.0   Succeeded
ocs-operator.v4.7.1-403.ci      OpenShift Container Storage   4.7.1-403.ci   ocs-operator.v4.6.4             Succeeded

oc rsh -n openshift-storage rook-ceph-tools-784547f7c7-qxfz7
sh-4.4# ceph config dump|grep mds_cache_memory_limit
    mds.ocs-storagecluster-cephfilesystem-a              basic    mds_cache_memory_limit             4294967296
    mds.ocs-storagecluster-cephfilesystem-b              basic    mds_cache_memory_limit             4294967296

So looks OK and will mark as verified.

Comment 49 errata-xmlrpc 2021-06-15 16:50:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2449