Bug 2228635

Summary: (mds.1): 3 slow requests are blocked
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Scott Nipp <snipp>
Component: CephFSAssignee: Xiubo Li <xiubli>
Status: ASSIGNED --- QA Contact: Hemanth Kumar <hyelloji>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1CC: ceph-eng-bugs, cephqe-warriors, gfarnum, mcaldeir, ngangadh, pdonnell, vumrao, xiubli
Target Milestone: ---   
Target Release: 7.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Scott Nipp 2023-08-02 22:37:34 UTC
Description of problem:
User getting multiple (mds.1): 3 slow requests are blocked  per day. these will not clear until the mds gets manually failed.

here's the pattern observed from ceph tell mds.1 dump_blocked_ops
( 7.29/1/mds.1.dump_blocked_ops.txt  == Month/Day/Occurrence/mds.1.dump_blocked_ops.txt )

7.29/1/mds.1.dump_blocked_ops.txt:            "description": "client_request(client.61253774:12217861 unlink #0x100013e7b0d/krb5cc_wss_zswdll1p_823_20230729040702 2023-07-29T08:07:03.058452+0000 caller_uid=842788, caller_gid=667140{})",
7.29/1/mds.1.dump_blocked_ops.txt:            "description": "client_request(mds.1:10267 rename #0x100013e7b0d/krb5cc_wss_zswdll1p_823_20230729040702 #0x60e/2000922403e caller_uid=0, caller_gid=0{})",
7.29/1/mds.1.dump_blocked_ops.txt:            "description": "client_request(mds.1:10268 rename #0x100013e7b0d/krb5cc_wss_zswdll1p_823_20230729040702 #0x60e/2000922403e caller_uid=0, caller_gid=0{})",

7.29/2/mds.1.dump_blocked_ops.txt:            "description": "client_request(client.61253774:12863103 unlink #0x100013e7b0d/krb5cc_wss_zswdll1p_17005_20230729173158 2023-07-29T21:31:59.137464+0000 caller_uid=842788, caller_gid=667140{})",
7.29/2/mds.1.dump_blocked_ops.txt:            "description": "client_request(mds.1:48248 rename #0x100013e7b0d/krb5cc_wss_zswdll1p_17005_20230729173158 #0x612/200092a27c9 caller_uid=0, caller_gid=0{})",
7.29/2/mds.1.dump_blocked_ops.txt:            "description": "client_request(mds.1:48249 rename #0x100013e7b0d/krb5cc_wss_zswdll1p_17005_20230729173158 #0x612/200092a27c9 caller_uid=0, caller_gid=0{})",

Version-Release number of selected component (if applicable):
RHCS 6.1 (17.2.6-70.el9cp)

How reproducible:
Occurring for customer 1-3 times per day.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2023-08-02 22:37:43 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Scott Nipp 2023-08-02 22:44:27 UTC
We have already requested from the customer...

Can you get us an SOS report off your lead MON node and also upload all the MDS logs from every node hosting and MDS instance?

Can you also list blocked ops and in flight ops and redirect that output to a file? Attach that file to the case also

Comment 3 Scott Nipp 2023-08-02 22:45:26 UTC
Please let us know if there is anything additional you would like for us to obtain from the customer for this BZ.