Bug 2238663

Summary: mds: blocklist clients with "bloated" session metadata
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Venky Shankar <vshankar>
Component: CephFSAssignee: Venky Shankar <vshankar>
Status: CLOSED ERRATA QA Contact: sumr
Severity: urgent Docs Contact: Rivka Pollack <rpollack>
Priority: unspecified    
Version: 5.3CC: akraj, assingh, ceph-eng-bugs, cephqe-warriors, gconsalv, mcaldeir, ngangadh, pdonnell, sumr, tserlin, vdas
Target Milestone: ---   
Target Release: 7.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-18.2.0-111.el9cp Doc Type: Bug Fix
Doc Text:
.Blocklist and evict client for large session metadata Previously, large client metadata buildup in the MDS would sometimes cause the MDS to switch to read-only mode. With this fix, the client that is causing the buildup is blocklisted and evicted, allowing the MDS to work as expected.
Story Points: ---
Clone Of:
: 2238665 2238666 (view as bug list) Environment:
Last Closed: 2023-12-13 15:23:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2238665, 2238666    

Description Venky Shankar 2023-09-13 04:01:10 UTC
If the session's "completed_requests" vector gets too large, the session can get to a size where the MDS goes read-only because the OSD rejects sessionmap object updates with "Message size too long".

2023-07-10 13:53:30.529 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:30.529 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:30.530 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:30.534 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:30.534 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:30.534 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:35.635 7f8fe687e700 -1 mds.0.2679609 unhandled write error (90) Message too long, force readonly...
2023-07-10 13:53:35.635 7f8fe687e700  1 mds.0.cache force file system read-only
2023-07-10 13:53:35.635 7f8fe687e700  0 log_channel(cluster) log [WRN] : force file system read-only

If a session exceeds some configurable encoded size (maybe 16MB), then evict it.

Note for QE: steps to reproduce can be followed by the test case here: https://github.com/ceph/ceph/pull/52944/commits/84df4b3d0c9e767a74cf5af80e8138239992df2c#diff-1da45c7534a9accb30d17e5abf05f55ca5cc0df3a7fe826049c0fe23154a7d63R225

Comment 1 RHEL Program Management 2023-09-13 04:01:20 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Gianluca Consalvi 2023-09-14 08:19:24 UTC
hello,we have customer that facing this issue and block the ongoing upgrade,we kindly request a backport to ocp 4.12 and eta for that.

Comment 22 errata-xmlrpc 2023-12-13 15:23:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780

Comment 23 Red Hat Bugzilla 2024-06-16 04:25:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days