Bug 2107110 - [GSS] MDS is behind on trimming and it's degraded [NEEDINFO]
Summary: [GSS] MDS is behind on trimming and it's degraded
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Patrick Donnelly
QA Contact: avdhoot
URL:
Whiteboard:
Depends On: 2109703 2130125
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-14 11:23 UTC by Priya Pandey
Modified: 2023-08-09 16:37 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-10 15:27:49 UTC
Embargoed:
mduasope: needinfo? (nojha)
mduasope: needinfo? (nojha)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 56577 0 None None None 2022-07-15 21:00:18 UTC
Red Hat Bugzilla 2109876 0 unspecified NEW [GSS] mds.0.log _replay journaler got error -2, aborting 2023-05-31 15:28:08 UTC

Internal Links: 2108228 2109876

Description Priya Pandey 2022-07-14 11:23:56 UTC
Description of problem (please be detailed as possible and provide log
snippets):

- MDS is behind on trimming and it's degraded.

--------------------------------------
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pqcn01w3354.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19984715
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pwcn01w3359.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19995583
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Behind on trimming (514/256) max_segments: 256, num_segments: 514
--------------------------------------

- Cu has recently upgraded the cluster from v4.8.11 to v4.9.7, after that MDS is having issues.


- The ceph status was healthy before the upgrade and the current status is healthy: 

--------------------------------------
  cluster:
    id:     6e9995b1-8e3f-4bfe-b883-a92d1dfeb68d
    health: HEALTH_WARN
            1 filesystem is degraded
            2 clients failing to respond to cache pressure
 
  services:
    mon: 3 daemons, quorum b,f,i (age 25h)
    mgr: a(active, since 22h)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 25h), 3 in (since 9M)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   11 pools, 273 pgs
    objects: 6.27M objects, 315 GiB
    usage:   1.9 TiB used, 10 TiB / 12 TiB avail
    pgs:     273 active+clean
--------------------------------------

Version of all relevant components (if applicable):
v4.9.7


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- The PVC provisioning from cephfs is not working.

- Also, some pods are failing to mount the cephfs.

Is there any workaround available to the best of your knowledge?

N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
N/A

Can this issue reproducible?
N/A

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
N/A

Steps to Reproduce:
N/A


Actual results:

- The MDS MDS is behind on trimming and degraded.


Expected results:

- MDS should be working fine.

Additional info:

In the next comments

Comment 4 Gerben Wiersma 2022-07-14 12:17:28 UTC
Latest CEPH health detail output:

ceph health detail
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pqcn01w3354.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19984715
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Client pwcn01w3359.isl.belastingdienst.nl:csi-cephfs-node failing to respond to cache pressure client_id: 19995583
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-a(mds.0): Behind on trimming (515/256) max_segments: 256, num_segments: 515

Comment 8 Varsha Shrivastava 2022-07-15 09:52:09 UTC
Team,

Customer is eagerly looking for a fix on priority owing downtime since last Tuesday, Can you please help us with tentative update?

Comment 42 Venky Shankar 2022-08-26 02:33:15 UTC
Clearing my NI as per https://bugzilla.redhat.com/show_bug.cgi?id=2107110#c41


Note You need to log in before you can comment on or make changes to this bug.