2294715 – [RHCS 7.1z1] MDS: "1 MDSs behind on trimming" and "2 clients failing to respond to cache pressure".

Bug 2294715 - [RHCS 7.1z1] MDS: "1 MDSs behind on trimming" and "2 clients failing to respond to cache pressure".

Summary: [RHCS 7.1z1] MDS: "1 MDSs behind on trimming" and "2 clients failing to respo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	5.3
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	7.1z1
Assignee:	Patrick Donnelly
QA Contact:	Amarnath
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-06-28 13:24 UTC by Patrick Donnelly
Modified:	2024-08-07 11:20 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ceph-18.2.1-207.el9cp
Doc Type:	Bug Fix
Doc Text:	Previously, MDS may not queue the next client request for replay in the up:client-replay state. Due to this the MDS would hang in the up:client-replay state. With this fix, the next client replay request is queued automatically as part of request cleanup and MDS proceeds with failover recovery normally.
Clone Of:
Environment:
Last Closed:	2024-08-07 11:20:46 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-9259	0	None	None	None	2024-06-28 13:26:08 UTC
Red Hat Product Errata	RHBA-2024:5080	0	None	None	None	2024-08-07 11:20:49 UTC

Description Patrick Donnelly 2024-06-28 13:24:03 UTC

This bug was initially created as a copy of Bug #2272099

I am copying this bug because: 

commit lost that was backported to 7.0 and not in 7.1

This bug was initially created as a copy of Bug #2243105

I am copying this bug because: 

7.0z2 backport

Description of problem:  MDS: "1 MDSs behind on trimming" and "2 clients failing to respond to cache pressure".

The site is having the issue detailed below. I did have them gather some MDS data from what I've learned in the past and it's extremely strange that there are no Ops In Flight, there are only completed Ops. Not sure what that is indicating.

I also scanned every file in the MG for anything indicating SELinux relabeling was at play, it seems that is NOT the case

The data is loaded in SS under case 03632353.

====
drwxrwxrwx+ 3 yank yank     59 Oct 10 13:15  0040-odf-must-gather-2.tar.gz
drwxrwxrwx+ 3 yank yank     26 Oct 10 15:29  0050-ceph-debug-logs.tar.gz
====

Attachment 0050 [details] is a tar.gz with get ops in flight, session ls and perf dump every few seconds. Again, I see not ops in flight, but certain clients with many completed ops.

I'll apologize in advance, I feel like I've missed something obvious

BR
Manny


-bash 5.1 $ cat ceph_health_detail 
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-103-131:csi-cephfs-node failing to respond to cache pressure client_id: 31330220
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-114-50:csi-cephfs-node failing to respond to cache pressure client_id: 34512838
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Behind on trimming (2026/256) max_segments: 256, num_segments: 2026


-bash 5.1 $ cat ceph_versions 
{
    "mon": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 3
    },
    "mds": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 9
    }
}

Searching for set extended attributes (while not zero, these are extremely low):

-bash 5.1 $ find ./ -type f -exec zgrep -ic setxatt {} \; | grep -v ^0
456
332
362
45
10
4893
6601
1
5644
558
474
894
256
504
672
515
1127
619
940
1102
302
751
680
359
6617
406
470
908
545
664
282
928
520
644
534
312
912
630




Version-Release number of selected component (if applicable):   RHCS 5.3z4


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 6 errata-xmlrpc 2024-08-07 11:20:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.1 security and bug fix update.), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:5080

Note You need to log in before you can comment on or make changes to this bug.