2272099 – MDS: "1 MDSs behind on trimming" and "2 clients failing to respond to cache pressure".

Bug 2272099 - MDS: "1 MDSs behind on trimming" and "2 clients failing to respond to cache pressure".

Summary: MDS: "1 MDSs behind on trimming" and "2 clients failing to respond to cache p...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	5.3
Hardware:	All
OS:	All
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	7.0z2
Assignee:	Patrick Donnelly
QA Contact:	Hemanth Kumar
Docs Contact:
URL:
Whiteboard:
Depends On:	2243105
Blocks:
TreeView+	depends on / blocked

Reported:	2024-03-28 20:02 UTC by Patrick Donnelly
Modified:	2024-10-31 04:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ceph-18.2.0-181.el9cp
Doc Type:	If docs needed, set a value
Doc Text:	Previously, MDS may not queue the next client request for replay in the up:client-replay state. Due to this the MDS would hang in the up:client-replay state. With this fix, the next client replay request is queued automatically as part of request cleanup and MDS proceeds with failover recovery normally. With this fix, the next client replay request is queued automatically as part of request cleanup and MDS proceeds with failover recovery normally.
Clone Of:
Environment:
Last Closed:	2024-05-07 12:11:43 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	63418	None	None	None	2024-03-28 20:02:58 UTC
Red Hat Issue Tracker	RHCEPH-8688	None	None	None	2024-03-28 20:04:50 UTC
Red Hat Product Errata	RHBA-2024:2743	None	None	None	2024-05-07 12:11:46 UTC

Description Patrick Donnelly 2024-03-28 20:02:59 UTC

This bug was initially created as a copy of Bug #2243105

I am copying this bug because: 

7.0z2 backport

Description of problem:  MDS: "1 MDSs behind on trimming" and "2 clients failing to respond to cache pressure".

The site is having the issue detailed below. I did have them gather some MDS data from what I've learned in the past and it's extremely strange that there are no Ops In Flight, there are only completed Ops. Not sure what that is indicating.

I also scanned every file in the MG for anything indicating SELinux relabeling was at play, it seems that is NOT the case

The data is loaded in SS under case 03632353.

====
drwxrwxrwx+ 3 yank yank     59 Oct 10 13:15  0040-odf-must-gather-2.tar.gz
drwxrwxrwx+ 3 yank yank     26 Oct 10 15:29  0050-ceph-debug-logs.tar.gz
====

Attachment 0050 [details] is a tar.gz with get ops in flight, session ls and perf dump every few seconds. Again, I see not ops in flight, but certain clients with many completed ops.

I'll apologize in advance, I feel like I've missed something obvious

BR
Manny


-bash 5.1 $ cat ceph_health_detail 
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-103-131:csi-cephfs-node failing to respond to cache pressure client_id: 31330220
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-114-50:csi-cephfs-node failing to respond to cache pressure client_id: 34512838
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Behind on trimming (2026/256) max_segments: 256, num_segments: 2026


-bash 5.1 $ cat ceph_versions 
{
    "mon": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 3
    },
    "mds": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 9
    }
}

Searching for set extended attributes (while not zero, these are extremely low):

-bash 5.1 $ find ./ -type f -exec zgrep -ic setxatt {} \; | grep -v ^0
456
332
362
45
10
4893
6601
1
5644
558
474
894
256
504
672
515
1127
619
940
1102
302
751
680
359
6617
406
470
908
545
664
282
928
520
644
534
312
912
630




Version-Release number of selected component (if applicable):   RHCS 5.3z4


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 7 Amarnath 2024-04-08 01:35:03 UTC

As per the comment https://bugzilla.redhat.com/show_bug.cgi?id=2243105#c21

Sanity runs :
http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.0-187/Sanity/341/tier-0_fs/

Regards,
Amarnath

Comment 8 Amarnath 2024-04-16 06:01:07 UTC

Hi Patrick,

Could you please confirm if this BZ needs to be added to the 7.0z2 release notes? If so, please provide the doc type and the doc text.

Regards,
Amarnath

Comment 9 Amarnath 2024-04-18 07:27:56 UTC

Hi Patrick,

Doc Test Picked from backport BZ : https://bugzilla.redhat.com/show_bug.cgi?id=2243105

Regards,
Amarnath

Comment 10 Venky Shankar 2024-04-18 08:02:26 UTC

(In reply to Amarnath from comment #9)
> Hi Patrick,
> 
> Doc Test Picked from backport BZ :
> https://bugzilla.redhat.com/show_bug.cgi?id=2243105

Where is it updated then?

Comment 11 Amarnath 2024-04-21 00:49:12 UTC

Hi venky,

Doc team directly updated to Release Notes.

Regards,
Amarnath

Comment 12 errata-xmlrpc 2024-05-07 12:11:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:2743

Comment 18 Red Hat Bugzilla 2024-10-31 04:25:04 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.