Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2272099

Summary: MDS: "1 MDSs behind on trimming" and "2 clients failing to respond to cache pressure".
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Patrick Donnelly <pdonnell>
Component: CephFSAssignee: Patrick Donnelly <pdonnell>
Status: CLOSED ERRATA QA Contact: Hemanth Kumar <hyelloji>
Severity: high Docs Contact:
Priority: medium    
Version: 5.3CC: amk, ceph-eng-bugs, cephqe-warriors, dwalveka, rpollack, tserlin, vshankar
Target Milestone: ---   
Target Release: 7.0z2   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: ceph-18.2.0-181.el9cp Doc Type: If docs needed, set a value
Doc Text:
Previously, MDS may not queue the next client request for replay in the up:client-replay state. Due to this the MDS would hang in the up:client-replay state. With this fix, the next client replay request is queued automatically as part of request cleanup and MDS proceeds with failover recovery normally. With this fix, the next client replay request is queued automatically as part of request cleanup and MDS proceeds with failover recovery normally.
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-05-07 12:11:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2243105    
Bug Blocks:    

Description Patrick Donnelly 2024-03-28 20:02:59 UTC
This bug was initially created as a copy of Bug #2243105

I am copying this bug because: 

7.0z2 backport

Description of problem:  MDS: "1 MDSs behind on trimming" and "2 clients failing to respond to cache pressure".

The site is having the issue detailed below. I did have them gather some MDS data from what I've learned in the past and it's extremely strange that there are no Ops In Flight, there are only completed Ops. Not sure what that is indicating.

I also scanned every file in the MG for anything indicating SELinux relabeling was at play, it seems that is NOT the case

The data is loaded in SS under case 03632353.

====
drwxrwxrwx+ 3 yank yank     59 Oct 10 13:15  0040-odf-must-gather-2.tar.gz
drwxrwxrwx+ 3 yank yank     26 Oct 10 15:29  0050-ceph-debug-logs.tar.gz
====

Attachment 0050 [details] is a tar.gz with get ops in flight, session ls and perf dump every few seconds. Again, I see not ops in flight, but certain clients with many completed ops.

I'll apologize in advance, I feel like I've missed something obvious

BR
Manny


-bash 5.1 $ cat ceph_health_detail 
HEALTH_WARN 1 filesystem is degraded; 2 clients failing to respond to cache pressure; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] MDS_CLIENT_RECALL: 2 clients failing to respond to cache pressure
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-103-131:csi-cephfs-node failing to respond to cache pressure client_id: 31330220
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Client ip-10-2-114-50:csi-cephfs-node failing to respond to cache pressure client_id: 34512838
[WRN] MDS_TRIM: 1 MDSs behind on trimming
    mds.ocs-storagecluster-cephfilesystem-b(mds.0): Behind on trimming (2026/256) max_segments: 256, num_segments: 2026


-bash 5.1 $ cat ceph_versions 
{
    "mon": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 3
    },
    "mds": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 9
    }
}

Searching for set extended attributes (while not zero, these are extremely low):

-bash 5.1 $ find ./ -type f -exec zgrep -ic setxatt {} \; | grep -v ^0
456
332
362
45
10
4893
6601
1
5644
558
474
894
256
504
672
515
1127
619
940
1102
302
751
680
359
6617
406
470
908
545
664
282
928
520
644
534
312
912
630




Version-Release number of selected component (if applicable):   RHCS 5.3z4


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 8 Amarnath 2024-04-16 06:01:07 UTC
Hi Patrick,

Could you please confirm if this BZ needs to be added to the 7.0z2 release notes? If so, please provide the doc type and the doc text.

Regards,
Amarnath

Comment 9 Amarnath 2024-04-18 07:27:56 UTC
Hi Patrick,

Doc Test Picked from backport BZ : https://bugzilla.redhat.com/show_bug.cgi?id=2243105

Regards,
Amarnath

Comment 10 Venky Shankar 2024-04-18 08:02:26 UTC
(In reply to Amarnath from comment #9)
> Hi Patrick,
> 
> Doc Test Picked from backport BZ :
> https://bugzilla.redhat.com/show_bug.cgi?id=2243105

Where is it updated then?

Comment 11 Amarnath 2024-04-21 00:49:12 UTC
Hi venky,

Doc team directly updated to Release Notes.

Regards,
Amarnath

Comment 12 errata-xmlrpc 2024-05-07 12:11:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:2743

Comment 18 Red Hat Bugzilla 2024-10-31 04:25:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days