Bug 1951348 - [GSS][CephFS] health warning "MDS cache is too large (3GB/1GB); 0 inodes in use by clients, 0 stray files" for the standby-replay
Summary: [GSS][CephFS] health warning "MDS cache is too large (3GB/1GB); 0 inodes in u...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.6
Hardware: All
OS: All
high
high
Target Milestone: ---
: OCS 4.6.5
Assignee: Sébastien Han
QA Contact: Mugdha Soni
URL:
Whiteboard:
Depends On: 1944148
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-20 01:57 UTC by Mudit Agarwal
Modified: 2021-08-25 09:25 UTC (History)
16 users (show)

Fixed In Version: 4.6.5-411.ci
Doc Type: Bug Fix
Doc Text:
Previously, OpenShift Container Storage 4.2 clusters were not updated with the correct cache value, and hence MDSs in standby-replay might report an oversized cache, as rook did not apply the `mds_cache_memory_limit` argument during upgrades. With this update, the `mds_cache_memory_limit` argument is applied during upgrades and the mds daemon operates normally.
Clone Of: 1944148
Environment:
Last Closed: 2021-06-17 15:46:42 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift rook pull 224 0 None open Bug 1944148: ceph: always apply config flags for mds and rgw 2021-04-21 08:29:12 UTC
Github red-hat-storage ocs-ci pull 4453 0 None closed Testcase to check mds cache memory limit post ocs upgrade 2021-06-30 06:08:18 UTC
Red Hat Product Errata RHSA-2021:2479 0 None None None 2021-06-17 15:47:05 UTC

Comment 2 Sébastien Han 2021-05-17 13:16:59 UTC
Steps to reproduce:

* Deploy 4.2
* See that mds_cache_memory_limit is not set in the ceph config centralized mon store (ceph config dump|grep mds_cache_memory_limit)
* Upgrade all the ways up to the 4.6.5 release and see that mds_cache_memory_limit is set with a value of 4GB

Thanks!

Comment 4 Mudit Agarwal 2021-05-24 05:37:38 UTC
This can be backported now.

Comment 5 Sébastien Han 2021-05-24 07:03:22 UTC
Backported

Comment 10 Mugdha Soni 2021-06-04 08:15:01 UTC
Deployed OCP 4.3 - OCS 4.2 initially and then upgraded one by one version of OCS and OCP till OCS 4.6.5 - OCP 4.6.32.The cluster was healthy after multiple upgrades.

[root@localhost ocs4_2]# oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES                        PHASE
lib-bucket-provisioner.v2.0.0   lib-bucket-provisioner        2.0.0          lib-bucket-provisioner.v1.0.0   Succeeded
ocs-operator.v4.6.5-411.ci      OpenShift Container Storage   4.6.5-411.ci   ocs-operator.v4.6.4             Succeeded

[root@localhost ocs4_2]# oc version
Client Version: 4.5.6
Server Version: 4.6.32
Kubernetes Version: v1.20.0+a0b09eb

sh-4.4# ceph config dump
WHO       MASK LEVEL    OPTION                             VALUE    RO 
global         advanced bluestore_warn_on_legacy_statfs    false       
global         advanced mon_allow_pool_delete              true        
global         advanced mon_pg_warn_min_per_osd            0           
global         advanced osd_pool_default_pg_autoscale_mode on          
global         advanced rbd_default_features               3           
  mgr          advanced mgr/balancer/active                true        
  mgr          advanced mgr/balancer/mode                  upmap       
  mgr          advanced mgr/orchestrator_cli/orchestrator  rook     *  
    osd.0      advanced osd_delete_sleep                   2.000000    
    osd.0      advanced osd_recovery_sleep                 0.100000    
    osd.0      advanced osd_snap_trim_sleep                2.000000    
    osd.1      advanced osd_delete_sleep                   2.000000    
    osd.1      advanced osd_recovery_sleep                 0.100000    
    osd.1      advanced osd_snap_trim_sleep                2.000000    
    osd.2      advanced osd_delete_sleep                   2.000000    
    osd.2      advanced osd_recovery_sleep                 0.100000    
    osd.2      advanced osd_snap_trim_sleep                2.000000    

mds_cache_memory_limit was not set in the ceph config centralized mon store.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sh-4.4# ceph status
  cluster:
    id:     3aa16ada-a02c-4861-a57f-1d720cfc6f4e
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 10h)
    mgr: a(active, since 10h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 10h), 3 in (since 2d)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   3 pools, 72 pgs
    objects: 7.65k objects, 29 GiB
    usage:   89 GiB used, 1.4 TiB / 1.5 TiB avail
    pgs:     72 active+clean
 
  io:
    client:   1.2 KiB/s rd, 60 KiB/s wr, 2 op/s rd, 6 op/s wr

sh-4.4# ceph health detail
HEALTH_OK

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sh-4.4# ceph fs status
ocs-storagecluster-cephfilesystem - 2 clients
=================================
+------+----------------+-------------------------------------+---------------+-------+-------+
| Rank |     State      |                 MDS                 |    Activity   |  dns  |  inos |
+------+----------------+-------------------------------------+---------------+-------+-------+
|  0   |     active     | ocs-storagecluster-cephfilesystem-a | Reqs:    0 /s |   15  |   18  |
| 0-s  | standby-replay | ocs-storagecluster-cephfilesystem-b | Evts:    0 /s |    5  |    8  |
+------+----------------+-------------------------------------+---------------+-------+-------+
+--------------------------------------------+----------+-------+-------+
|                    Pool                    |   type   |  used | avail |
+--------------------------------------------+----------+-------+-------+
| ocs-storagecluster-cephfilesystem-metadata | metadata |  672k |  455G |
|  ocs-storagecluster-cephfilesystem-data0   |   data   | 48.0k |  455G |
+--------------------------------------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
+-------------+
MDS version: ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

sh-4.4# ceph config dump|grep mds_cache_memory_limit
    mds.ocs-storagecluster-cephfilesystem-a      basic    mds_cache_memory_limit             4294967296                          
    mds.ocs-storagecluster-cephfilesystem-b      basic    mds_cache_memory_limit             4294967296                          

The mds_cache_memory_limit is set with a value of 4GB, looks good to me.

Hence moving this bug to verifed .

Thanks 
Mugdha Soni

Comment 21 errata-xmlrpc 2021-06-17 15:46:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.5 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2479


Note You need to log in before you can comment on or make changes to this bug.