1951348 – [GSS][CephFS] health warning "MDS cache is too large (3GB/1GB); 0 inodes in use by clients, 0 stray files" for the standby-replay

Bug 1951348 - [GSS][CephFS] health warning "MDS cache is too large (3GB/1GB); 0 inodes in use by clients, 0 stray files" for the standby-replay

Summary: [GSS][CephFS] health warning "MDS cache is too large (3GB/1GB); 0 inodes in u...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.6
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.6.5
Assignee:	Sébastien Han
QA Contact:	Mugdha Soni
Docs Contact:
URL:
Whiteboard:
Depends On:	1944148
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-20 01:57 UTC by Mudit Agarwal
Modified:	2024-10-01 17:57 UTC (History)
CC List:	16 users (show)
Fixed In Version:	4.6.5-411.ci
Doc Type:	Bug Fix
Doc Text:	Previously, OpenShift Container Storage 4.2 clusters were not updated with the correct cache value, and hence MDSs in standby-replay might report an oversized cache, as rook did not apply the `mds_cache_memory_limit` argument during upgrades. With this update, the `mds_cache_memory_limit` argument is applied during upgrades and the mds daemon operates normally.
Clone Of:	1944148
Environment:
Last Closed:	2021-06-17 15:46:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 224	None	open	Bug 1944148: ceph: always apply config flags for mds and rgw	2021-04-21 08:29:12 UTC
Github	red-hat-storage ocs-ci pull 4453	None	closed	Testcase to check mds cache memory limit post ocs upgrade	2021-06-30 06:08:18 UTC
Red Hat Product Errata	RHSA-2021:2479	None	None	None	2021-06-17 15:47:05 UTC

Comment 2 Sébastien Han 2021-05-17 13:16:59 UTC

Steps to reproduce:

* Deploy 4.2
* See that mds_cache_memory_limit is not set in the ceph config centralized mon store (ceph config dump|grep mds_cache_memory_limit)
* Upgrade all the ways up to the 4.6.5 release and see that mds_cache_memory_limit is set with a value of 4GB

Thanks!

Comment 4 Mudit Agarwal 2021-05-24 05:37:38 UTC

This can be backported now.

Comment 5 Sébastien Han 2021-05-24 07:03:22 UTC

Backported

Comment 10 Mugdha Soni 2021-06-04 08:15:01 UTC

Deployed OCP 4.3 - OCS 4.2 initially and then upgraded one by one version of OCS and OCP till OCS 4.6.5 - OCP 4.6.32.The cluster was healthy after multiple upgrades.

[root@localhost ocs4_2]# oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES                        PHASE
lib-bucket-provisioner.v2.0.0   lib-bucket-provisioner        2.0.0          lib-bucket-provisioner.v1.0.0   Succeeded
ocs-operator.v4.6.5-411.ci      OpenShift Container Storage   4.6.5-411.ci   ocs-operator.v4.6.4             Succeeded

[root@localhost ocs4_2]# oc version
Client Version: 4.5.6
Server Version: 4.6.32
Kubernetes Version: v1.20.0+a0b09eb

sh-4.4# ceph config dump
WHO       MASK LEVEL    OPTION                             VALUE    RO 
global         advanced bluestore_warn_on_legacy_statfs    false       
global         advanced mon_allow_pool_delete              true        
global         advanced mon_pg_warn_min_per_osd            0           
global         advanced osd_pool_default_pg_autoscale_mode on          
global         advanced rbd_default_features               3           
  mgr          advanced mgr/balancer/active                true        
  mgr          advanced mgr/balancer/mode                  upmap       
  mgr          advanced mgr/orchestrator_cli/orchestrator  rook     *  
    osd.0      advanced osd_delete_sleep                   2.000000    
    osd.0      advanced osd_recovery_sleep                 0.100000    
    osd.0      advanced osd_snap_trim_sleep                2.000000    
    osd.1      advanced osd_delete_sleep                   2.000000    
    osd.1      advanced osd_recovery_sleep                 0.100000    
    osd.1      advanced osd_snap_trim_sleep                2.000000    
    osd.2      advanced osd_delete_sleep                   2.000000    
    osd.2      advanced osd_recovery_sleep                 0.100000    
    osd.2      advanced osd_snap_trim_sleep                2.000000    

mds_cache_memory_limit was not set in the ceph config centralized mon store.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sh-4.4# ceph status
  cluster:
    id:     3aa16ada-a02c-4861-a57f-1d720cfc6f4e
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 10h)
    mgr: a(active, since 10h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 10h), 3 in (since 2d)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   3 pools, 72 pgs
    objects: 7.65k objects, 29 GiB
    usage:   89 GiB used, 1.4 TiB / 1.5 TiB avail
    pgs:     72 active+clean
 
  io:
    client:   1.2 KiB/s rd, 60 KiB/s wr, 2 op/s rd, 6 op/s wr

sh-4.4# ceph health detail
HEALTH_OK

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sh-4.4# ceph fs status
ocs-storagecluster-cephfilesystem - 2 clients
=================================
+------+----------------+-------------------------------------+---------------+-------+-------+
| Rank |     State      |                 MDS                 |    Activity   |  dns  |  inos |
+------+----------------+-------------------------------------+---------------+-------+-------+
|  0   |     active     | ocs-storagecluster-cephfilesystem-a | Reqs:    0 /s |   15  |   18  |
| 0-s  | standby-replay | ocs-storagecluster-cephfilesystem-b | Evts:    0 /s |    5  |    8  |
+------+----------------+-------------------------------------+---------------+-------+-------+
+--------------------------------------------+----------+-------+-------+
|                    Pool                    |   type   |  used | avail |
+--------------------------------------------+----------+-------+-------+
| ocs-storagecluster-cephfilesystem-metadata | metadata |  672k |  455G |
|  ocs-storagecluster-cephfilesystem-data0   |   data   | 48.0k |  455G |
+--------------------------------------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
+-------------+
MDS version: ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

sh-4.4# ceph config dump|grep mds_cache_memory_limit
    mds.ocs-storagecluster-cephfilesystem-a      basic    mds_cache_memory_limit             4294967296                          
    mds.ocs-storagecluster-cephfilesystem-b      basic    mds_cache_memory_limit             4294967296                          

The mds_cache_memory_limit is set with a value of 4GB, looks good to me.

Hence moving this bug to verifed .

Thanks 
Mugdha Soni

Comment 21 errata-xmlrpc 2021-06-17 15:46:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.5 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2479

Note You need to log in before you can comment on or make changes to this bug.