The issue is reported by upstream community user. The cluster had two filesystems and the active mds of both the filesystems were stuck in 'up:replay'. This was the case for around 2 days. Later, one of the active mds (stuck in up:replay) state crashed with below stack trace. /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: 1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions) ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7fccd759943f] 2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605] 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c) [0x55fb2b98e89c] 4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0] 5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443] 6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31] 7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca] 8: clone() The ceph tracker is https://tracker.ceph.com/issues/58489.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:3623