Bug 2177864 - [5.2 ceph cluster] Both MDS are stuck in replay
Summary: [5.2 ceph cluster] Both MDS are stuck in replay
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 5.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 6.1z1
Assignee: Venky Shankar
QA Contact: Hemanth Kumar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-13 17:53 UTC by Brett Hull
Modified: 2023-07-12 01:42 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-12 01:42:56 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6257 0 None None None 2023-03-13 17:53:53 UTC

Description Brett Hull 2023-03-13 17:53:02 UTC
Description of problem: 
both mds 0 and 1 are stuck in replay,  keeps failing to standby mds.
Prior to issue:
1) both mds active.
2) kicked off a mds scrub via:
   a) ceph tell mds.0 scrub start / recursive,repair

Attempted to pause mds scrubs

This mds paused quickly ceph-mds.root.host7.oqqvka.asok
This one is still trying to pause ceph-mds.root.host3.rdnzhn.asok
tried to abort after pausing has been running for 10 minutes,
now its just stuck pausing+aborting
{
    "status": "PAUSING+ABORTING (2959255 inodes in the stack)",
    "scrubs": {
        "382d27c9-2a0f-472e-bc68-3149557ff890": {
            "path": "/",
            "tag": "382d27c9-2a0f-472e-bc68-3149557ff890",
            "options": "recursive,repair"
        }
    }
}

The pausing/aborting would not clear on its own.
we ended up failing both mds, after that all mds scrubing had stopped.
we kicked off mds scrub again, and it finished for both mds.

we also did a compaction of the omap for each ssd.
We have mounted 70 clients and re-started their rsyncs.

So we are back up, and appear stable.
ceph fs status
root - 70 clients
====
RANK  STATE             MDS               ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active  root.host1.amvgfe  Reqs:  667 /s  8258k  8250k   230k   283k
 1    active  root.host3.rdnzhn  Reqs: 2610 /s  4333k  4333k  28.7k   376k
    POOL       TYPE     USED  AVAIL
cephfs.meta  metadata  45.0G  12.8T
cephfs.data    data     128T   244T
      STANDBY MDS
root.host7.oqqvka
root.host5.lbrqru
MDS version: ceph version 16.2.8-85.el8cp (0bdc6db9a80af40dd496b05674a938d406a9f6f5) pacific (stable)

Version-Release number of selected component (if applicable):
MDS version: ceph version 16.2.8-85.el8cp 

How reproducible:
This is not reproducible, but we have seen it twice so far. 
1) 21-Feb-2023 - both mds' were stuck in replay
2) 07-Mar-2023 - One mds stuck in replay
   a) mds.1 was stuck in replay for aprox. 15 minutes.
   b) then went to reconnect, rejoin, active
   c) finished. 


Steps to Reproduce: N/A

Actual results:
MDS stuck in replay. Ceph only cluster.

Expected results:
MDS do not become stuck.

Additional info:
Similar if not the same behavior reported in BZ 2107110 - closed with not enough data. 

Causes cephFS to be unavailable. users are impacted.


Note You need to log in before you can comment on or make changes to this bug.