Bug 2107110

Summary:	[GSS] MDS is behind on trimming and it's degraded
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Priya Pandey <prpandey>
Component:	ceph	Assignee:	Patrick Donnelly <pdonnell>
ceph sub component:	CephFS	QA Contact:	avdhoot <asagare>
Status:	CLOSED INSUFFICIENT_DATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	andbartl, asagare, bhull, bkunal, bniver, gfarnum, g.wiersma, hnallurv, hyelloji, joboyer, mduasope, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdonnell, sheggodu, tnielsen, vshankar, vshrivas, vumrao
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-03-10 15:27:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2109703, 2130125
Bug Blocks:

Description Priya Pandey 2022-07-14 11:23:56 UTC

Description of problem (please be detailed as possible and provide log
snippets):

- MDS is behind on trimming and it's degraded.

--------------------------------------
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pqcn01w3354.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19984715
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pwcn01w3359.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19995583
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Behind on trimming (514/256) max_segments: 256, num_segments: 514
--------------------------------------

- Cu has recently upgraded the cluster from v4.8.11 to v4.9.7, after that MDS is having issues.


- The ceph status was healthy before the upgrade and the current status is healthy: 

--------------------------------------
  cluster:
    id:     6e9995b1-8e3f-4bfe-b883-a92d1dfeb68d
    health: HEALTH_WARN
            1 filesystem is degraded
            2 clients failing to respond to cache pressure
 
  services:
    mon: 3 daemons, quorum b,f,i (age 25h)
    mgr: a(active, since 22h)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 25h), 3 in (since 9M)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   11 pools, 273 pgs
    objects: 6.27M objects, 315 GiB
    usage:   1.9 TiB used, 10 TiB / 12 TiB avail
    pgs:     273 active+clean
--------------------------------------

Version of all relevant components (if applicable):
v4.9.7


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- The PVC provisioning from cephfs is not working.

- Also, some pods are failing to mount the cephfs.

Is there any workaround available to the best of your knowledge?

N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
N/A

Can this issue reproducible?
N/A

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
N/A

Steps to Reproduce:
N/A


Actual results:

- The MDS MDS is behind on trimming and degraded.


Expected results:

- MDS should be working fine.

Additional info:

In the next comments

Comment 4 Gerben Wiersma 2022-07-14 12:17:28 UTC

Latest CEPH health detail output:

ceph health detail
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pqcn01w3354.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19984715
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pwcn01w3359.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19995583
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Behind on trimming (515/256) max_segments: 256, num_segments: 515

Comment 8 Varsha Shrivastava 2022-07-15 09:52:09 UTC

Team,

Customer is eagerly looking for a fix on priority owing downtime since last Tuesday, Can you please help us with tentative update?

Comment 42 Venky Shankar 2022-08-26 02:33:15 UTC

Clearing my NI as per https://bugzilla.redhat.com/show_bug.cgi?id=2107110#c41

Comment 102 Red Hat Bugzilla 2023-12-08 04:29:37 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days