Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2177864

Summary:	[5.2 ceph cluster] Both MDS are stuck in replay
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Brett Hull <bhull>
Component:	CephFS	Assignee:	Venky Shankar <vshankar>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Hemanth Kumar <hyelloji>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.2	CC:	ceph-eng-bugs, cephqe-warriors, gfarnum, snipp, vshankar
Target Milestone:	---
Target Release:	6.1z1
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-07-12 01:42:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Brett Hull 2023-03-13 17:53:02 UTC

Description of problem:
both mds 0 and 1 are stuck in replay, keeps failing to standby mds.
Prior to issue:
1) both mds active.
2) kicked off a mds scrub via:
a) ceph tell mds.0 scrub start / recursive,repair

Attempted to pause mds scrubs

This mds paused quickly ceph-mds.root.host7.oqqvka.asok
This one is still trying to pause ceph-mds.root.host3.rdnzhn.asok
tried to abort after pausing has been running for 10 minutes,
now its just stuck pausing+aborting
{
"status": "PAUSING+ABORTING (2959255 inodes in the stack)",
"scrubs": {
"382d27c9-2a0f-472e-bc68-3149557ff890": {
"path": "/",
"tag": "382d27c9-2a0f-472e-bc68-3149557ff890",
"options": "recursive,repair"
}
}
}

The pausing/aborting would not clear on its own.
we ended up failing both mds, after that all mds scrubing had stopped.
we kicked off mds scrub again, and it finished for both mds.

we also did a compaction of the omap for each ssd.
We have mounted 70 clients and re-started their rsyncs.

So we are back up, and appear stable.
ceph fs status
root - 70 clients
====
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active root.host1.amvgfe Reqs: 667 /s 8258k 8250k 230k 283k
1 active root.host3.rdnzhn Reqs: 2610 /s 4333k 4333k 28.7k 376k
POOL TYPE USED AVAIL
cephfs.meta metadata 45.0G 12.8T
cephfs.data data 128T 244T
STANDBY MDS
root.host7.oqqvka
root.host5.lbrqru
MDS version: ceph version 16.2.8-85.el8cp (0bdc6db9a80af40dd496b05674a938d406a9f6f5) pacific (stable)

Version-Release number of selected component (if applicable):
MDS version: ceph version 16.2.8-85.el8cp

How reproducible:
This is not reproducible, but we have seen it twice so far.
1) 21-Feb-2023 - both mds' were stuck in replay
2) 07-Mar-2023 - One mds stuck in replay
a) mds.1 was stuck in replay for aprox. 15 minutes.
b) then went to reconnect, rejoin, active
c) finished.

Steps to Reproduce: N/A

Actual results:
MDS stuck in replay. Ceph only cluster.

Expected results:
MDS do not become stuck.

Additional info:
Similar if not the same behavior reported in BZ 2107110 - closed with not enough data.

Causes cephFS to be unavailable. users are impacted.