Bug 2177864

Summary: [5.2 ceph cluster] Both MDS are stuck in replay
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Brett Hull <bhull>
Component: CephFSAssignee: Venky Shankar <vshankar>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Hemanth Kumar <hyelloji>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: ceph-eng-bugs, cephqe-warriors, gfarnum, snipp, vshankar
Target Milestone: ---   
Target Release: 6.1z1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-12 01:42:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brett Hull 2023-03-13 17:53:02 UTC
Description of problem: 
both mds 0 and 1 are stuck in replay,  keeps failing to standby mds.
Prior to issue:
1) both mds active.
2) kicked off a mds scrub via:
   a) ceph tell mds.0 scrub start / recursive,repair

Attempted to pause mds scrubs

This mds paused quickly ceph-mds.root.host7.oqqvka.asok
This one is still trying to pause ceph-mds.root.host3.rdnzhn.asok
tried to abort after pausing has been running for 10 minutes,
now its just stuck pausing+aborting
{
    "status": "PAUSING+ABORTING (2959255 inodes in the stack)",
    "scrubs": {
        "382d27c9-2a0f-472e-bc68-3149557ff890": {
            "path": "/",
            "tag": "382d27c9-2a0f-472e-bc68-3149557ff890",
            "options": "recursive,repair"
        }
    }
}

The pausing/aborting would not clear on its own.
we ended up failing both mds, after that all mds scrubing had stopped.
we kicked off mds scrub again, and it finished for both mds.

we also did a compaction of the omap for each ssd.
We have mounted 70 clients and re-started their rsyncs.

So we are back up, and appear stable.
ceph fs status
root - 70 clients
====
RANK  STATE             MDS               ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active  root.host1.amvgfe  Reqs:  667 /s  8258k  8250k   230k   283k
 1    active  root.host3.rdnzhn  Reqs: 2610 /s  4333k  4333k  28.7k   376k
    POOL       TYPE     USED  AVAIL
cephfs.meta  metadata  45.0G  12.8T
cephfs.data    data     128T   244T
      STANDBY MDS
root.host7.oqqvka
root.host5.lbrqru
MDS version: ceph version 16.2.8-85.el8cp (0bdc6db9a80af40dd496b05674a938d406a9f6f5) pacific (stable)

Version-Release number of selected component (if applicable):
MDS version: ceph version 16.2.8-85.el8cp 

How reproducible:
This is not reproducible, but we have seen it twice so far. 
1) 21-Feb-2023 - both mds' were stuck in replay
2) 07-Mar-2023 - One mds stuck in replay
   a) mds.1 was stuck in replay for aprox. 15 minutes.
   b) then went to reconnect, rejoin, active
   c) finished. 


Steps to Reproduce: N/A

Actual results:
MDS stuck in replay. Ceph only cluster.

Expected results:
MDS do not become stuck.

Additional info:
Similar if not the same behavior reported in BZ 2107110 - closed with not enough data. 

Causes cephFS to be unavailable. users are impacted.