Bug 2107110 - [GSS] MDS is behind on trimming and it's degraded
Summary: [GSS] MDS is behind on trimming and it's degraded
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Patrick Donnelly
QA Contact: avdhoot
URL:
Whiteboard:
Depends On: 2109703 2130125
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-14 11:23 UTC by Priya Pandey
Modified: 2023-12-08 04:29 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-10 15:27:49 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 56577 0 None None None 2022-07-15 21:00:18 UTC
Red Hat Bugzilla 2109876 0 unspecified CLOSED [GSS] mds.0.log _replay journaler got error -2, aborting 2023-09-29 11:28:00 UTC

Internal Links: 2108228 2109876

Description Priya Pandey 2022-07-14 11:23:56 UTC
Description of problem (please be detailed as possible and provide log
snippets):

- MDS is behind on trimming and it's degraded.

--------------------------------------
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pqcn01w3354.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19984715
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pwcn01w3359.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19995583
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Behind on trimming (514/256) max_segments: 256, num_segments: 514
--------------------------------------

- Cu has recently upgraded the cluster from v4.8.11 to v4.9.7, after that MDS is having issues.


- The ceph status was healthy before the upgrade and the current status is healthy: 

--------------------------------------
  cluster:
    id:     6e9995b1-8e3f-4bfe-b883-a92d1dfeb68d
    health: HEALTH_WARN
            1 filesystem is degraded
            2 clients failing to respond to cache pressure
 
  services:
    mon: 3 daemons, quorum b,f,i (age 25h)
    mgr: a(active, since 22h)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 25h), 3 in (since 9M)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   11 pools, 273 pgs
    objects: 6.27M objects, 315 GiB
    usage:   1.9 TiB used, 10 TiB / 12 TiB avail
    pgs:     273 active+clean
--------------------------------------

Version of all relevant components (if applicable):
v4.9.7


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- The PVC provisioning from cephfs is not working.

- Also, some pods are failing to mount the cephfs.

Is there any workaround available to the best of your knowledge?

N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
N/A

Can this issue reproducible?
N/A

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
N/A

Steps to Reproduce:
N/A


Actual results:

- The MDS MDS is behind on trimming and degraded.


Expected results:

- MDS should be working fine.

Additional info:

In the next comments

Comment 4 Gerben Wiersma 2022-07-14 12:17:28 UTC
Latest CEPH health detail output:

ceph health detail
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pqcn01w3354.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19984715
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pwcn01w3359.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19995583
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Behind on trimming (515/256) max_segments: 256, num_segments: 515

Comment 8 Varsha Shrivastava 2022-07-15 09:52:09 UTC
Team,

Customer is eagerly looking for a fix on priority owing downtime since last Tuesday, Can you please help us with tentative update?

Comment 42 Venky Shankar 2022-08-26 02:33:15 UTC
Clearing my NI as per https://bugzilla.redhat.com/show_bug.cgi?id=2107110#c41

Comment 102 Red Hat Bugzilla 2023-12-08 04:29:37 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.