2107110 – [GSS] MDS is behind on trimming and it's degraded

Bug 2107110 - [GSS] MDS is behind on trimming and it's degraded

Summary: [GSS] MDS is behind on trimming and it's degraded

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Patrick Donnelly
QA Contact:	avdhoot
Docs Contact:
URL:
Whiteboard:
Depends On:	2109703 2130125
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-14 11:23 UTC by Priya Pandey
Modified:	2023-12-08 04:29 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-03-10 15:27:49 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	56577	0	None	None	None	2022-07-15 21:00:18 UTC
Red Hat Bugzilla	2109876	0	unspecified	CLOSED	[GSS] mds.0.log _replay journaler got error -2, aborting	2023-09-29 11:28:00 UTC

Internal Links: 2108228 2109876

Description Priya Pandey 2022-07-14 11:23:56 UTC

Description of problem (please be detailed as possible and provide log
snippets):

- MDS is behind on trimming and it's degraded.

--------------------------------------
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pqcn01w3354.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19984715
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pwcn01w3359.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19995583
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Behind on trimming (514/256) max_segments: 256, num_segments: 514
--------------------------------------

- Cu has recently upgraded the cluster from v4.8.11 to v4.9.7, after that MDS is having issues.


- The ceph status was healthy before the upgrade and the current status is healthy: 

--------------------------------------
  cluster:
    id:     6e9995b1-8e3f-4bfe-b883-a92d1dfeb68d
    health: HEALTH_WARN
            1 filesystem is degraded
            2 clients failing to respond to cache pressure
 
  services:
    mon: 3 daemons, quorum b,f,i (age 25h)
    mgr: a(active, since 22h)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 25h), 3 in (since 9M)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   11 pools, 273 pgs
    objects: 6.27M objects, 315 GiB
    usage:   1.9 TiB used, 10 TiB / 12 TiB avail
    pgs:     273 active+clean
--------------------------------------

Version of all relevant components (if applicable):
v4.9.7


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- The PVC provisioning from cephfs is not working.

- Also, some pods are failing to mount the cephfs.

Is there any workaround available to the best of your knowledge?

N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
N/A

Can this issue reproducible?
N/A

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
N/A

Steps to Reproduce:
N/A


Actual results:

- The MDS MDS is behind on trimming and degraded.


Expected results:

- MDS should be working fine.

Additional info:

In the next comments

Comment 4 Gerben Wiersma 2022-07-14 12:17:28 UTC

Latest CEPH health detail output:

ceph health detail
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pqcn01w3354.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19984715
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pwcn01w3359.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19995583
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Behind on trimming (515/256) max_segments: 256, num_segments: 515

Comment 8 Varsha Shrivastava 2022-07-15 09:52:09 UTC

Team,

Customer is eagerly looking for a fix on priority owing downtime since last Tuesday, Can you please help us with tentative update?

Comment 42 Venky Shankar 2022-08-26 02:33:15 UTC

Clearing my NI as per https://bugzilla.redhat.com/show_bug.cgi?id=2107110#c41

Comment 102 Red Hat Bugzilla 2023-12-08 04:29:37 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

andbartl
asagare
bhull
bkunal
bniver
gfarnum
g.wiersma
hnallurv
hyelloji
joboyer
mduasope
muagarwa
nojha
ocs-bugs
odf-bz-bot
pdonnell
sheggodu
tnielsen
vshankar
vshrivas
vumrao