Bug 2107110

Summary: [GSS] MDS is behind on trimming and it's degraded
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Priya Pandey <prpandey>
Component: cephAssignee: Patrick Donnelly <pdonnell>
ceph sub component: CephFS QA Contact: avdhoot <asagare>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: medium    
Priority: medium CC: andbartl, asagare, bhull, bkunal, bniver, gfarnum, g.wiersma, hnallurv, hyelloji, joboyer, mduasope, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdonnell, sheggodu, tnielsen, vshankar, vshrivas, vumrao
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-10 15:27:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2109703, 2130125    
Bug Blocks:    

Description Priya Pandey 2022-07-14 11:23:56 UTC
Description of problem (please be detailed as possible and provide log
snippets):

- MDS is behind on trimming and it's degraded.

--------------------------------------
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pqcn01w3354.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19984715
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pwcn01w3359.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19995583
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Behind on trimming (514/256) max_segments: 256, num_segments: 514
--------------------------------------

- Cu has recently upgraded the cluster from v4.8.11 to v4.9.7, after that MDS is having issues.


- The ceph status was healthy before the upgrade and the current status is healthy: 

--------------------------------------
  cluster:
    id:     6e9995b1-8e3f-4bfe-b883-a92d1dfeb68d
    health: HEALTH_WARN
            1 filesystem is degraded
            2 clients failing to respond to cache pressure
 
  services:
    mon: 3 daemons, quorum b,f,i (age 25h)
    mgr: a(active, since 22h)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 25h), 3 in (since 9M)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   11 pools, 273 pgs
    objects: 6.27M objects, 315 GiB
    usage:   1.9 TiB used, 10 TiB / 12 TiB avail
    pgs:     273 active+clean
--------------------------------------

Version of all relevant components (if applicable):
v4.9.7


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- The PVC provisioning from cephfs is not working.

- Also, some pods are failing to mount the cephfs.

Is there any workaround available to the best of your knowledge?

N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
N/A

Can this issue reproducible?
N/A

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
N/A

Steps to Reproduce:
N/A


Actual results:

- The MDS MDS is behind on trimming and degraded.


Expected results:

- MDS should be working fine.

Additional info:

In the next comments

Comment 4 Gerben Wiersma 2022-07-14 12:17:28 UTC
Latest CEPH health detail output:

ceph health detail
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pqcn01w3354.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19984715
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pwcn01w3359.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19995583
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Behind on trimming (515/256) max_segments: 256, num_segments: 515

Comment 8 Varsha Shrivastava 2022-07-15 09:52:09 UTC
Team,

Customer is eagerly looking for a fix on priority owing downtime since last Tuesday, Can you please help us with tentative update?

Comment 42 Venky Shankar 2022-08-26 02:33:15 UTC
Clearing my NI as per https://bugzilla.redhat.com/show_bug.cgi?id=2107110#c41

Comment 102 Red Hat Bugzilla 2023-12-08 04:29:37 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days