Bug 2294715

Summary:	[RHCS 7.1z1] MDS: "1 MDSs behind on trimming" and "2 clients failing to respond to cache pressure".
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Patrick Donnelly <pdonnell>
Component:	CephFS	Assignee:	Patrick Donnelly <pdonnell>
Status:	CLOSED ERRATA	QA Contact:	Amarnath <amk>
Severity:	medium	Docs Contact:	Akash Raj <akraj>
Priority:	unspecified
Version:	5.3	CC:	akraj, ceph-eng-bugs, cephqe-warriors, gfarnum, julpark, tserlin
Target Milestone:	---
Target Release:	7.1z1
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	ceph-18.2.1-207.el9cp	Doc Type:	Bug Fix
Doc Text:	Previously, MDS may not queue the next client request for replay in the up:client-replay state. Due to this the MDS would hang in the up:client-replay state. With this fix, the next client replay request is queued automatically as part of request cleanup and MDS proceeds with failover recovery normally.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-08-07 11:20:46 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Patrick Donnelly 2024-06-28 13:24:03 UTC

This bug was initially created as a copy of Bug #2272099

I am copying this bug because: 

commit lost that was backported to 7.0 and not in 7.1

This bug was initially created as a copy of Bug #2243105

I am copying this bug because: 

7.0z2 backport

Description of problem:  MDS: "1 MDSs behind on trimming" and "2 clients failing to respond to cache pressure".

The site is having the issue detailed below. I did have them gather some MDS data from what I've learned in the past and it's extremely strange that there are no Ops In Flight, there are only completed Ops. Not sure what that is indicating.

I also scanned every file in the MG for anything indicating SELinux relabeling was at play, it seems that is NOT the case

The data is loaded in SS under case 03632353.

====
drwxrwxrwx+ 3 yank yank     59 Oct 10 13:15  0040-odf-must-gather-2.tar.gz
drwxrwxrwx+ 3 yank yank     26 Oct 10 15:29  0050-ceph-debug-logs.tar.gz
====

Attachment 0050 [details] is a tar.gz with get ops in flight, session ls and perf dump every few seconds. Again, I see not ops in flight, but certain clients with many completed ops.

I'll apologize in advance, I feel like I've missed something obvious

BR
Manny


-bash 5.1 $ cat ceph_health_detail 
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-103-131:csi-cephfs-node failing to respond to cache pressure client_id: 31330220
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-114-50:csi-cephfs-node failing to respond to cache pressure client_id: 34512838
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Behind on trimming (2026/256) max_segments: 256, num_segments: 2026


-bash 5.1 $ cat ceph_versions 
{
    "mon": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 3
    },
    "mds": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 9
    }
}

Searching for set extended attributes (while not zero, these are extremely low):

-bash 5.1 $ find ./ -type f -exec zgrep -ic setxatt {} \; | grep -v ^0
456
332
362
45
10
4893
6601
1
5644
558
474
894
256
504
672
515
1127
619
940
1102
302
751
680
359
6617
406
470
908
545
664
282
928
520
644
534
312
912
630




Version-Release number of selected component (if applicable):   RHCS 5.3z4


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 6 errata-xmlrpc 2024-08-07 11:20:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.1 security and bug fix update.), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:5080