Bug 2228635

Summary:	(mds.1): 3 slow requests are blocked
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Scott Nipp <snipp>
Component:	CephFS	Assignee:	Xiubo Li <xiubli>
Status:	CLOSED ERRATA	QA Contact:	Hemanth Kumar <hyelloji>
Severity:	high	Docs Contact:	Rivka Pollack <rpollack>
Priority:	unspecified
Version:	6.1	CC:	akraj, bkunal, ceph-eng-bugs, cephqe-warriors, gfarnum, mcaldeir, ngangadh, pdonnell, tserlin, vumrao, xiubli
Target Milestone:	---
Target Release:	7.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ceph-18.2.0-46.el9cp	Doc Type:	Bug Fix
Doc Text:	.Deadlocks no longer occur between the unlink and reintegration requests Previously, when fixing async dirop bug, a regression was introduced by previous commits, causing deadlocks between the unlink and reintegration request. With this fix, the old commits are reverted and there is no longer a deadlock between unlink and reintegration requests.	Story Points:	---
Clone Of:
Clones:	2233131 (view as bug list)		Environment:
Last Closed:	2023-12-13 15:21:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2233131, 2237662

Description Scott Nipp 2023-08-02 22:37:34 UTC

Description of problem:
User getting multiple (mds.1): 3 slow requests are blocked  per day. these will not clear until the mds gets manually failed.

here's the pattern observed from ceph tell mds.1 dump_blocked_ops
( 7.29/1/mds.1.dump_blocked_ops.txt  == Month/Day/Occurrence/mds.1.dump_blocked_ops.txt )

7.29/1/mds.1.dump_blocked_ops.txt:            "description": "client_request(client.61253774:12217861 unlink #0x100013e7b0d/krb5cc_wss_zswdll1p_823_20230729040702 2023-07-29T08:07:03.058452+0000 caller_uid=842788, caller_gid=667140{})",
7.29/1/mds.1.dump_blocked_ops.txt:            "description": "client_request(mds.1:10267 rename #0x100013e7b0d/krb5cc_wss_zswdll1p_823_20230729040702 #0x60e/2000922403e caller_uid=0, caller_gid=0{})",
7.29/1/mds.1.dump_blocked_ops.txt:            "description": "client_request(mds.1:10268 rename #0x100013e7b0d/krb5cc_wss_zswdll1p_823_20230729040702 #0x60e/2000922403e caller_uid=0, caller_gid=0{})",

7.29/2/mds.1.dump_blocked_ops.txt:            "description": "client_request(client.61253774:12863103 unlink #0x100013e7b0d/krb5cc_wss_zswdll1p_17005_20230729173158 2023-07-29T21:31:59.137464+0000 caller_uid=842788, caller_gid=667140{})",
7.29/2/mds.1.dump_blocked_ops.txt:            "description": "client_request(mds.1:48248 rename #0x100013e7b0d/krb5cc_wss_zswdll1p_17005_20230729173158 #0x612/200092a27c9 caller_uid=0, caller_gid=0{})",
7.29/2/mds.1.dump_blocked_ops.txt:            "description": "client_request(mds.1:48249 rename #0x100013e7b0d/krb5cc_wss_zswdll1p_17005_20230729173158 #0x612/200092a27c9 caller_uid=0, caller_gid=0{})",

Version-Release number of selected component (if applicable):
RHCS 6.1 (17.2.6-70.el9cp)

How reproducible:
Occurring for customer 1-3 times per day.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2023-08-02 22:37:43 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Scott Nipp 2023-08-02 22:44:27 UTC

We have already requested from the customer...

Can you get us an SOS report off your lead MON node and also upload all the MDS logs from every node hosting and MDS instance?

Can you also list blocked ops and in flight ops and redirect that output to a file? Attach that file to the case also

Comment 3 Scott Nipp 2023-08-02 22:45:26 UTC

Please let us know if there is anything additional you would like for us to obtain from the customer for this BZ.

Comment 33 Scott Nipp 2023-08-18 14:57:08 UTC

So BofA is still experiencing occassional occurrences of slow/blocked ops on clusters that have been upgraded to 6.1z1.

In their PVCEPH cluster they had another occurrence @ Thu Aug 17 04:50:13 EDT 2023.  They provided the following files uploaded to SupportShell in case 03578367.
ceph-mds.root.host3.wnboxv.log <-- mds.1 before fail @ Thu Aug 17 04:50:13 EDT 2023
ceph-mds.root.host7.oqqvka.log <-- mds.1 after fail @ Thu Aug 17 04:50:13 EDT 2023
mds.1.1692262213.failed.tar.gz <-- taken before mds.1 fail @ Thu Aug 17 04:50:13 EDT 2023
pvceph.ceph.config.dump.mds.txt

In their PTCEPH cluster they are reporting 4 occurrences since 8/6/2023.  Here is a snapshot of those files in SupportShell:
$ yank 03590519
Authenticating the user using the OIDC device authorization grant ...

The SSO authentication is successful

Initializing yank for case 03590519 ...
Retrieving attachments listing for case 03590519 ...

|   IDX |  PRFX  | FILENAME                                   |   SIZE (KB) | DATE                 | SOURCE   |   CACHED |
|-------|--------|--------------------------------------------|-------------|----------------------|----------|----------|
|     1 |  0010  | mds.1.1692238512.failed.tar.gz             |       33.62 | 2023-08-17 15:22 UTC | S3       |      No  |
|     2 |  0020  | ceph-mds.root.host4.duplag.log-20230817.gz |    12417.59 | 2023-08-17 15:22 UTC | S3       |      No  |
|     3 |  0030  | ceph-mds.root.host0.djvost.log-20230817.gz |   149948.99 | 2023-08-17 15:22 UTC | S3       |      No  |

Comment 49 Manny 2023-09-07 10:30:17 UTC

See KCS article #7031927, (https://access.redhat.com/solutions/7031927)

Comment 68 errata-xmlrpc 2023-12-13 15:21:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780